chemfp shardsearch

The “chemfp shardsearch” command-line tool does a similarity search of multiple target files, possibly in parallel, and merges the results.

It can be used to search very large datasets which do not fit into RAM, datasets which are too large for the FPB file format, or datasets which are naturally viewed as multiple files (eg, datasets with a large quarterly release and small weekly cumulative updates).

By contention each target file is referred to as a “shard”. The following searches all shards matching the glob pattern “*.fpb” and outputs those which are at least 0.2 similiar to the query SMILES, with the output in csv format:

% chemfp shardsearch --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
        --threshold 0.2 shard*.fpb --out csv
query_id,target_id,score
Query1,CHEMBL327316,0.2203390
Query1,CHEMBL539424,0.2340426

Shardsearch can be configured to process a user-defined number of queries against all of the shards in parallel (up to --num-threads at a time), or visit each shard in order, which is useful if the entire dataset does not all fit into memory.

If the shards are on a network file system then you should compress the FPB files with ZStandard, that is, as fpb.zst files, because most of the time will be spent transfering the data from the remote computer, decompressing is fast, and ZStandard compresses better than gzip.

The rest of this chapter contains the output from chemfp shardsearch --help.

chemfp shardsearch command-line options

The following comes from chemfp shardsearch --help:

Usage: chemfp shardsearch [OPTIONS] [TARGET_FILENAMES]...

  Search 1 or more fingerprint files for similarity.

Options:
  -k, --k-nearest, --k K          Select the k nearest neighbors (use 'all'
                                  for all neighbors)
  -t, --threshold FLOAT RANGE     Minimum similarity score threshold
                                  [0.0<=x<=1.0]
  --beta FLOAT                    Tversky beta parameter (default: the value
                                  of --alpha)
  --alpha FLT                     Tversky alpha parameter (default: 1.0)
  -q, --queries PATH              Filename containing the query fingerprints
  --query TEXT                    query as a structure record (default format:
                                  'smi')
  --hex-query, --hex HEX_STR      query in hex
  --query-id STR                  id for the query or hex-query (default:
                                  'Query1')
  --query-format, --in FORMAT     input query format (default uses the file
                                  extension, else 'fps')
  --target-format FORMAT          input target format (default uses the file
                                  extension, else 'fps')
  --load [first|once|each-time]   If 'first', load all the targets into memory
                                  before searching. If 'once' (the default),
                                  wait to load until needed. If 'each-time',
                                  (re)load each target for each input batch.
  --query-type STRING             fingerprint type string if the queries are
                                  structures (default: use the target
                                  fingerprint type)
  --id-tag NAME                   tag containing the record id if --query-
                                  format is an SD file)
  --errors [strict|report|ignore]
                                  how should structure parse errors be
                                  handled? (default=ignore)
  --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                  Forces '-R delimiter=VALUE'.
  --has-header                    Skip the first line of a SMILES or InChI
                                  file. Forces '-R has_header=1'.
  -R NAME=VALUE                   Specify a reader argument
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                  support for CXSMILES extensions. Forces '-R
                                  cxsmiles=1' or '-R cxsmiles=0'.
  --ordering, --order [decreasing-score|increasing-score|decreasing-score-plus|increasing-score-plus|reverse]
                                  Specify how the hits are ordered, rather
                                  than use the search-specific default
  -j, --num-threads N             The number of threads to use. -1 means the
                                  default value (6 for this computer). 0 and 1
                                  both mean single-threaded. (default: -1)
  --num-omp-threads INTEGER RANGE
                                  The number of OpenMP threads to use for each
                                  shard search. Only useful when there are
                                  multiple queries. Mixing regular threads and
                                  OpenMP threads may crash on macOS. (default:
                                  1)  [1<=x<=16]
  -o, --output FILENAME           output filename (default is stdout)
  --out FORMAT                    Output format. One of 'simsearch', 'csv', or
                                  'tsv' (default: based on filename, or
                                  'simsearch')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include header
                                  metadata in 'simsearch' output format.
  --include-empty / --no-include-empty
                                  In csv or tsv output, include a line for
                                  queries with no hits (the default)
  --empty-target-id STR           In csv or tsv output, the target id for a
                                  query with no hits (default: '*')
  --empty-score STR               In csv or tsv output, the score for a query
                                  with no hits (default: 'NaN')
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  -b, --batch-size INTEGER RANGE  Number of fingerprints to process at a time
                                  (default: 100)  [x>=1]
  --no-mmap                       Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --version                       Show the version and exit.
  --license-check                 Check the license and report results to
                                  stdout.
  --help                          Show this message and exit.

  Use shardsearch to do a similarity search between the query fingerprints
  (from stdin or --queries) and the target fingerprints in 1 or more target
  fingerprint files. Each target is called a "shard".

  A shard search reads a batch of fingerprints of size `--batch-size`
  (default: 100) then carries out a similarity search between that batch and
  each of the shards. The searches are done in a thread pool with `--num-
  threads` / `-j` threads. Each query x shard search uses `--num-omp-threads`,
  which defaults to 1 OpenMP thread.

  Good thread allocation is tricky. The main rule-of-thumb is to minimize data
  transfer. If you have many queries and the targets are all memory-mapped FPB
  files then set the number of OpenMP threads to 8 and the thread pool size to
  1. If you have one query then OpenMP is not used.

  If the data fits into RAM then the total number of threads should not be
  much larger than the number of memory channels.

  If your disk or network is slow, you might use ZStandard to compress the FPB
  file to fpb.zst format, which is about 1/3rd the size. This is reatively
  fast to uncompress into RAM.

  There are three strategies for how to load the target files. If `--load` is
  "first" then all targets are loaded when shardsearch starts. This is best if
  you have enough memory. If `--load` is "once" then targets are loaded when
  first needed. Note that shard search by default will memory-map an FPB file,
  which counts as virtual size but doesn't need much RAM.

  If `--load` is "each-time" then the shard is loaded only when it is time to
  process it in the thread pool, then it is closed, and re-loaded if it's
  needed again. You might use this if, for example, you have 1 query and only
  enough space for two shards (in fpb.zst format) to be loaded into memory at
  the same.