simsearch¶
The “simsearch” command-line tool (also available as “chemfp
simsearch”) carries out a similarity search of an FPS or FPB
fingerprint file. The chemfp.simsearch()
function implements
similar functionality in the Python API. See examples of
k-nearest
and
threshold search,
and of saving the output in
CSV
and
NumPy’s npz
formats.
simsearch command-line options¶
The following comes from simsearch --help
:
Usage: simsearch [OPTIONS] TARGET_FILENAME
Search an FPS or FPB file for similar fingerprints.
Options:
-k, --k-nearest, --k K Select the k nearest neighbors (use 'all'
for all neighbors)
-t, --threshold FLOAT RANGE Minimum similarity score threshold
[0.0<=x<=1.0]
--beta FLOAT Tversky beta parameter (default: the value
of --alpha)
--alpha FLT Tversky alpha parameter (default: 1.0)
-q, --queries PATH Filename containing the query fingerprints
--NxN Use the targets as the queries, and exclude
the self-similarity term
--query TEXT query as a structure record (default format:
'smi')
--hex-query, --hex HEX_STR query in hex
--query-id STR id for the query or hex-query (default:
'Query1')
--query-format, --in FORMAT input query format (default uses the file
extension, else 'fps')
--target-format FORMAT input target format (default uses the file
extension, else 'fps')
--query-type STRING fingerprint type string if the queries are
structures (default: use the target
fingerprint type)
--id-tag NAME tag containing the record id if --query-
format is an SD file)
--errors [strict|report|ignore]
how should structure parse errors be
handled? (default=ignore)
--delimiter VALUE Delimiter style for SMILES and InChI files.
Forces '-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI
file. Forces '-R has_header=1'.
-R NAME=VALUE Specify a reader argument
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default
support for CXSMILES extensions. Forces '-R
cxsmiles=1' or '-R cxsmiles=0'.
--ordering, --order [decreasing-score|increasing-score|decreasing-score-plus|increasing-score-plus|reverse]
Specify how the hits are ordered, rather
than use the search-specific default
-o, --output FILENAME output filename (default is stdout)
--out FORMAT Output format. One of 'simsearch', 'csv',
'tsv', or 'npz' (default: based on filename,
or 'simsearch')
--include-metadata / --no-metadata
With --no-metadata, do not include header
metadata in 'simsearch' output format.
--include-empty / --no-include-empty
In csv or tsv output, include a line for
queries with no hits (the default)
--empty-target-id STR In csv or tsv output, the target id for a
query with no hits (default: '*')
--empty-score STR In csv or tsv output, the score for a query
with no hits (default: 'NaN')
--precision [1|2|3|4|5|6|7|8|9|10]
Number of digits in Tanimoto score (default:
based on the fingerprint size)
-c, --count Report counts
-j, --num-threads N The number of threads to use. -1 means the
default value (which is 8 for this
computer), and can be set using
$OMP_NUM_THREADS. 0 and 1 both mean single-
threaded. (default: -1)
-b, --batch-size INTEGER RANGE Number of fingerprints to process at a time
[x>=1]
--scan Scan the file to find matches (low memory
overhead)
--memory Build and search an in-memory data structure
(faster for multiple queries)
--no-mmap Don't use mmap to read uncompressed FPB
files. May give better performance on
networked file systems, at the expense of
higher memory use.
--times / --no-times Write timing information to stderr
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--version Show the version and exit.
--license-check Check the license and report results to
stdout.
--license-file FILENAME Specify a chemfp license file
--traceback Print the traceback on KeyboardInterrupt
--version Show the version and exit.
--help Show this message and exit.
The supported --ordering parameters are:
* decreasing-score: sort hits from highest to lowest score
* increasing-score: sort hits from lowest to highest score
* decreasing-score-plus: decreasing-score, with ties broken by index/id
* increasing-score-plus: increasing-score, with ties broken by index/id
* reverse: reverse the output order
Examples:
* Find the nearest 2 ChEMBL fingerprints given a SMILES string. Write the
results to stdout in "simsearch" format, each query and its hits on one
line:
% simsearch --query c1ccccc1P chembl_34.fpb -k 2
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#targets=chembl_34.fpb
2 Query1 CHEMBL119405 0.4666667 CHEMBL14092 0.4285714
* Generate an NxN matix and save the results in an npy file compatible with
a SciPy sparse matrix.
% simsearch --NxN distinct.fps -o distinct.npz
* Use query fingerprints from a file (in FPS format) to search target
fingerprints (in FPB format), for fingerprints with a Tanimoto similarity of
at least 0.4. Write the matches to stdout in "csv" with one row for each
query hit. If there are no query hits then use "*" (the default) for the
target id and specify "NA" for the score.
% simsearch --queries queries.fps targets.fpb --threshold 0.41 \
--out csv --empty-score NA
query_id,target_id,score
22525101,22525003,0.4261364
22525101,22525019,0.4224599
22525101,22525016,0.9161290
22525102,*,NA
22525103,*,NA
22525104,22525016,0.4100418
* Do the same search but save the results to a tsv (tab-separated) file. The
format is inferred from the output filename.
% simsearch --queries queries.fps targets.fpb --threshold 0.41 \
--empty-score NA -o results.tsv --no-progress
% head -7 results.tsv
query_id target_id score
22525101 22525003 0.4261364
22525101 22525019 0.4224599
22525101 22525016 0.9161290
22525102 * NA
22525103 * NA
22525104 22525016 0.4100418