chemfp shardsearch¶
The “chemfp shardsearch” command-line tool does a similarity search of multiple target files, possibly in parallel, and merges the results.
It can be used to search very large datasets which do not fit into RAM, datasets which are too large for the FPB file format, or datasets which are naturally viewed as multiple files (eg, datasets with a large quarterly release and small weekly cumulative updates).
By contention each target file is referred to as a “shard”. The following searches all shards matching the glob pattern “*.fpb” and outputs those which are at least 0.2 similiar to the query SMILES, with the output in csv format:
% chemfp shardsearch --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
--threshold 0.2 shard*.fpb --out csv
query_id,target_id,score
Query1,CHEMBL327316,0.2203390
Query1,CHEMBL539424,0.2340426
Shardsearch can be configured to process a user-defined number of
queries against all of the shards in parallel (up to --num-threads
at a time), or visit each shard in order, which is useful if the
entire dataset does not all fit into memory.
If the shards are on a network file system then you should compress the FPB files with ZStandard, that is, as fpb.zst files, because most of the time will be spent transfering the data from the remote computer, decompressing is fast, and ZStandard compresses better than gzip.
The rest of this chapter contains the output from chemfp shardsearch --help.
chemfp shardsearch command-line options¶
The following comes from chemfp shardsearch --help
:
Usage: chemfp shardsearch [OPTIONS] [TARGET_FILENAMES]...
Search 1 or more fingerprint files for similarity.
Options:
-k, --k-nearest, --k K Select the k nearest neighbors (use 'all'
for all neighbors)
-t, --threshold FLOAT RANGE Minimum similarity score threshold
[0.0<=x<=1.0]
--beta FLOAT Tversky beta parameter (default: the value
of --alpha)
--alpha FLT Tversky alpha parameter (default: 1.0)
-q, --queries PATH Filename containing the query fingerprints
--query TEXT query as a structure record (default format:
'smi')
--hex-query, --hex HEX_STR query in hex
--query-id STR id for the query or hex-query (default:
'Query1')
--query-format, --in FORMAT input query format (default uses the file
extension, else 'fps')
--target-format FORMAT input target format (default uses the file
extension, else 'fps')
--load [first|once|each-time] If 'first', load all the targets into memory
before searching. If 'once' (the default),
wait to load until needed. If 'each-time',
(re)load each target for each input batch.
--query-type STRING fingerprint type string if the queries are
structures (default: use the target
fingerprint type)
--id-tag NAME tag containing the record id if --query-
format is an SD file)
--errors [strict|report|ignore]
how should structure parse errors be
handled? (default=ignore)
--delimiter VALUE Delimiter style for SMILES and InChI files.
Forces '-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI
file. Forces '-R has_header=1'.
-R NAME=VALUE Specify a reader argument
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default
support for CXSMILES extensions. Forces '-R
cxsmiles=1' or '-R cxsmiles=0'.
--ordering, --order [decreasing-score|increasing-score|decreasing-score-plus|increasing-score-plus|reverse]
Specify how the hits are ordered, rather
than use the search-specific default
-j, --num-threads N The number of threads to use. -1 means the
default value (6 for this computer). 0 and 1
both mean single-threaded. (default: -1)
--num-omp-threads INTEGER RANGE
The number of OpenMP threads to use for each
shard search. Only useful when there are
multiple queries. Mixing regular threads and
OpenMP threads may crash on macOS. (default:
1) [1<=x<=16]
-o, --output FILENAME output filename (default is stdout)
--out FORMAT Output format. One of 'simsearch', 'csv', or
'tsv' (default: based on filename, or
'simsearch')
--include-metadata / --no-metadata
With --no-metadata, do not include header
metadata in 'simsearch' output format.
--include-empty / --no-include-empty
In csv or tsv output, include a line for
queries with no hits (the default)
--empty-target-id STR In csv or tsv output, the target id for a
query with no hits (default: '*')
--empty-score STR In csv or tsv output, the score for a query
with no hits (default: 'NaN')
--precision [1|2|3|4|5|6|7|8|9|10]
Number of digits in Tanimoto score (default:
based on the fingerprint size)
-b, --batch-size INTEGER RANGE Number of fingerprints to process at a time
(default: 100) [x>=1]
--no-mmap Don't use mmap to read uncompressed FPB
files. May give better performance on
networked file systems, at the expense of
higher memory use.
--times / --no-times Write timing information to stderr
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--version Show the version and exit.
--license-check Check the license and report results to
stdout.
--help Show this message and exit.
Use shardsearch to do a similarity search between the query fingerprints
(from stdin or --queries) and the target fingerprints in 1 or more target
fingerprint files. Each target is called a "shard".
A shard search reads a batch of fingerprints of size `--batch-size`
(default: 100) then carries out a similarity search between that batch and
each of the shards. The searches are done in a thread pool with `--num-
threads` / `-j` threads. Each query x shard search uses `--num-omp-threads`,
which defaults to 1 OpenMP thread.
Good thread allocation is tricky. The main rule-of-thumb is to minimize data
transfer. If you have many queries and the targets are all memory-mapped FPB
files then set the number of OpenMP threads to 8 and the thread pool size to
1. If you have one query then OpenMP is not used.
If the data fits into RAM then the total number of threads should not be
much larger than the number of memory channels.
If your disk or network is slow, you might use ZStandard to compress the FPB
file to fpb.zst format, which is about 1/3rd the size. This is reatively
fast to uncompress into RAM.
There are three strategies for how to load the target files. If `--load` is
"first" then all targets are loaded when shardsearch starts. This is best if
you have enough memory. If `--load` is "once" then targets are loaded when
first needed. Note that shard search by default will memory-map an FPB file,
which counts as virtual size but doesn't need much RAM.
If `--load` is "each-time" then the shard is loaded only when it is time to
process it in the thread pool, then it is closed, and re-loaded if it's
needed again. You might use this if, for example, you have 1 query and only
enough space for two shards (in fpb.zst format) to be loaded into memory at
the same.