simsearch

The “simsearch” command-line tool (also available as “chemfp simsearch”) carries out a similarity search of an FPS or FPB fingerprint file. The chemfp.simsearch() function implements similar functionality in the Python API. See examples of k-nearest and threshold search, and of saving the output in CSV and NumPy’s npz formats.

simsearch command-line options

The following comes from simsearch --help:

Usage: simsearch [OPTIONS] TARGET_FILENAME

  Search an FPS or FPB file for similar fingerprints.

Options:
  -k, --k-nearest, --k K          Select the k nearest neighbors (use 'all'
                                  for all neighbors)
  -t, --threshold FLOAT RANGE     Minimum similarity score threshold
                                  [0.0<=x<=1.0]
  --beta FLOAT                    Tversky beta parameter (default: the value
                                  of --alpha)
  --alpha FLT                     Tversky alpha parameter (default: 1.0)
  -q, --queries PATH              Filename containing the query fingerprints
  --NxN                           Use the targets as the queries, and exclude
                                  the self-similarity term
  --query TEXT                    query as a structure record (default format:
                                  'smi')
  --hex-query, --hex HEX_STR      query in hex
  --query-id STR                  id for the query or hex-query (default:
                                  'Query1')
  --query-format, --in FORMAT     input query format (default uses the file
                                  extension, else 'fps')
  --target-format FORMAT          input target format (default uses the file
                                  extension, else 'fps')
  --query-type STRING             fingerprint type string if the queries are
                                  structures (default: use the target
                                  fingerprint type)
  --id-tag NAME                   tag containing the record id if --query-
                                  format is an SD file)
  --errors [strict|report|ignore]
                                  how should structure parse errors be
                                  handled? (default=ignore)
  --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                  Forces '-R delimiter=VALUE'.
  --has-header                    Skip the first line of a SMILES or InChI
                                  file. Forces '-R has_header=1'.
  -R NAME=VALUE                   Specify a reader argument
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                  support for CXSMILES extensions. Forces '-R
                                  cxsmiles=1' or '-R cxsmiles=0'.
  --ordering, --order [decreasing-score|increasing-score|decreasing-score-plus|increasing-score-plus|reverse]
                                  Specify how the hits are ordered, rather
                                  than use the search-specific default
  -o, --output FILENAME           output filename (default is stdout)
  --out FORMAT                    Output format. One of 'simsearch', 'csv',
                                  'tsv', or 'npz' (default: based on filename,
                                  or 'simsearch')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include header
                                  metadata in 'simsearch' output format.
  --include-empty / --no-include-empty
                                  In csv or tsv output, include a line for
                                  queries with no hits (the default)
  --empty-target-id STR           In csv or tsv output, the target id for a
                                  query with no hits (default: '*')
  --empty-score STR               In csv or tsv output, the score for a query
                                  with no hits (default: 'NaN')
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  -c, --count                     Report counts
  -j, --num-threads N             The number of threads to use. -1 means the
                                  default value (which is 8 for this
                                  computer), and can be set using
                                  $OMP_NUM_THREADS. 0 and 1 both mean single-
                                  threaded. (default: -1)
  -b, --batch-size INTEGER RANGE  Number of fingerprints to process at a time
                                  [x>=1]
  --scan                          Scan the file to find matches (low memory
                                  overhead)
  --memory                        Build and search an in-memory data structure
                                  (faster for multiple queries)
  --no-mmap                       Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --version                       Show the version and exit.
  --license-check                 Check the license and report results to
                                  stdout.
  --license-file FILENAME         Specify a chemfp license file
  --traceback                     Print the traceback on KeyboardInterrupt
  --version                       Show the version and exit.
  --help                          Show this message and exit.

  The supported --ordering parameters are:

    * decreasing-score: sort hits from highest to lowest score
    * increasing-score: sort hits from lowest to highest score
    * decreasing-score-plus: decreasing-score, with ties broken by index/id
    * increasing-score-plus: increasing-score, with ties broken by index/id
    * reverse: reverse the output order

  Examples:

  * Find the nearest 2 ChEMBL fingerprints given a SMILES string. Write the
  results to stdout in "simsearch" format, each query and its hits on one
  line:

    % simsearch --query c1ccccc1P chembl_34.fpb -k 2
    #Simsearch/1
    #num_bits=2048
    #type=Tanimoto k=2 threshold=0.0
    #software=chemfp/4.2
    #targets=chembl_34.fpb
    2     Query1  CHEMBL119405    0.4666667       CHEMBL14092     0.4285714

  * Generate an NxN matix and save the results in an npy file compatible with
  a SciPy sparse matrix.

    % simsearch --NxN distinct.fps -o distinct.npz

  * Use query fingerprints from a file (in FPS format) to search target
  fingerprints (in FPB format), for fingerprints with a Tanimoto similarity of
  at least 0.4. Write the matches to stdout in "csv" with one row for each
  query hit. If there are no query hits then use "*" (the default) for the
  target id and specify "NA" for the score.

    % simsearch --queries queries.fps targets.fpb --threshold 0.41 \
          --out csv --empty-score NA
    query_id,target_id,score
    22525101,22525003,0.4261364
    22525101,22525019,0.4224599
    22525101,22525016,0.9161290
    22525102,*,NA
    22525103,*,NA
    22525104,22525016,0.4100418

  * Do the same search but save the results to a tsv (tab-separated) file. The
  format is inferred from the output filename.

    % simsearch --queries queries.fps targets.fpb --threshold 0.41 \
          --empty-score NA -o results.tsv --no-progress
    % head -7 results.tsv
    query_id      target_id       score
    22525101      22525003        0.4261364
    22525101      22525019        0.4224599
    22525101      22525016        0.9161290
    22525102      *       NA
    22525103      *       NA
    22525104      22525016        0.4100418