chemfp spherex

The “chemfp spherex” command-line tool implements the sphere exclusion selection algorithm. See Sphere exclusion for an example of (undirected) sphere exclusion from a set of references, and Directed sphere exclusion for an example of directed sphere exclusion.

This functionality is also available from Python using the high-level chemfp.spherex() function, with corresponding examples at Sphere exclusion and Directed sphere exclusion.

The rest of this chapter contains the output from chemfp spherex --help.

chemfp spherex command-line options

The following comes from chemfp spherex --help:

Usage: chemfp spherex [OPTIONS] CANDIDATES

  Diversity selection using the sphere exclusion algorithm.

Options:
  -t, --threshold FLOAT           Maximum similarity (default: 1.0)
  -n, --num-picks N               Number of picks (default: 'all')
  --dise                          Use directed sphere exclusion
  --dise-references FILENAME      DISE reference structures or fingerprints
                                  (default uses the Gobbi & Lee structures)
  --dise-format FORMAT            Format of the DISE reference file (default
                                  uses the file extension, else 'fps')
  --ranks PATH                    File containing fingerprint rank values
  --ranks-default N               Default rank value if candidate id not found
                                  in the ranks file (default: 2**32-1)
                                  [0<=x<=4294967296]
  --ranks-format FORMAT           Format for the ranks file (can be 'tsv' or a
                                  fingerprint format)
  --ranks-has-header / --ranks-no-header
                                  Skip the first line of the ranks file
  --in, --candidates-format TEXT  Format of the candidates file (default uses
                                  filename extension, or 'fps')
  --references PATH               Fingerprint file containing reference
                                  fingerprints to avoid (the fingerprints you
                                  have)
  --references-format FORMAT      Format of the references file (default uses
                                  filename extension, or 'fps')
  --pick-id STR                   Initial candidate id (if no reference file).
                                  Can be used more than once.
  --pick-id-file PATH             File containing initial candidate ids, one
                                  per line
  --randomize / --no-randomize    Use --randomize (default for undirected
                                  picking) to randomly pick from the available
                                  candidates, or --no-randomize (default for
                                  directed picking) to pick the candidate with
                                  the smallest arena index.
  --seed N                        Specify the random number generator seed
                                  between 0 and 2**64-1, inclusive, or use -1
                                  to have one picked at random (default: -1)
  --mmap / --no-mmap              Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  -j, --num-threads N             The number of threads to use. -1 means the
                                  default value (which is 8 for this
                                  computer), and can be set using
                                  $OMP_NUM_THREADS. 0 and 1 both mean single-
                                  threaded. (default: -1)
  --include-members / --no-members
                                  Include ids and scores for fingerprint
                                  members in each sphere
  --save-picks-format PATH        Specify the format for the picked
                                  fingerprints.
  --save-candidates PATH          Write remaining candidate fingerprints to
                                  the named file.
  --save-candidates-format FORMAT
                                  Specify the format for the remaining
                                  candidate fingerprints.
  --save-picks PATH               Write picked fingerprints to the named file.
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  -o, --output PATH               Write output to the named file instead of
                                  stdout.
  --out TEXT                      Output format. Must be one of 'chemfp' (the
                                  default), 'csv', or 'tsv' with optional
                                  compression
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include header
                                  metadata in 'spherex' or 'centroid' output
                                  formats.
  --include-empty / --no-include-empty
                                  In csv and tsv format with --include-hits,
                                  include picks with no hits (the default)
  --empty-hit-id TEXT             The hit id if --include-empty outputs a pick
                                  with no hits (default: '*')
  --empty-score TEXT              The score if --include-empty outputs a pick
                                  with no hits (default: 'NaN')
  --pick-time / --no-pick-time    Include the elapsed time for each pick
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2022-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --no-type-warnings              Do not show fingerprint type warnings.
  --show-type-warnings            Show fingerprint type warnings.
  --help                          Show this message and exit.

  Select diverse fingerprints using the sphere exclusion algorithm (Hudson et
  al. (1996) QSAR, https://doi.org/10.1002/qsar.19960150402) with optional
  ranking for directed sphere exclusion (Gobbi and Lee (2003) JCICS,
  https://doi.org/10.1021/ci025554v).

  This method iteratively picks `--num-pick` / `-n` fingerprints from a set of
  candidates such that the fingerprint is not within a given threshold of
  similarity to any previously selected fingerprint. The default `--threshold`
  of 1.0 means only identical fingerprints will be selected.

  = Undirected picking =

  When no ranks are specified, the fingerprints are picked at random from the
  remaining candidate fingerprints. Use `--no-randomize` to select the
  fingerprints in fingerprint index order, which is based on the number of
  bits set in the fingerprint.

  = Directed picking =

  In directed picking, the fingerprints are picked in rank order, from
  smallest rank to largest. If multiple fingerprints have the same rank then
  by default the first is used. Use `--randomize` to randomize the order in a
  rank.

  There are several ways to specify the ranks. The `--dise` option uses the
  three SMILES from the DISE paper by Gobi and Lee to generate reference
  fingerprints and rank the candidate fingerprints by successive similarity to
  the references. Use a `--dise-references` file to specify different
  reference structures or fingerprints.

  The ranks can be specified in a `--ranks` file, in one of several formats.
  The 'tsv' format contains two tab-separated columns and an optional header.
  The first column is the candidate fingerprint id, the second column is its
  associated rank, which must be an integer or float.

  The 'txt' format contains one id per line and an optional header. The rank
  is 1 for the first id, 2 for the second, and so on.

  If the ranks file is a fingerprint file then the rank is 1 for the id of the
  first fingerprint, 2 for the second, and so on.

  = References =

  Use a `--references` fingerprint file to remove all candiate fingerprints
  which are within `--threshold` similarity of any of the reference
  fingerprints.

  = Initial picks =

  The initial picks can be specified by id (this cannot be combined with
  `--references`) either by using one `--pick-id` option per id, or using
  `--pick-id-file`, with one id per line.

  NOTE: `--pick-id-file` and a "txt"-formatted `--ranks` file are similar but
  not identical. When a pick id is specified, it is always included in the
  output, even that fingerprint was included in an earlier picked sphere. (In
  that case its count is 0, because its sphere doesn't even include itself.)
  In addition, when only some pick ids are specified then the remaining ids by
  default are picked at random, while unspecified rank by default picked are
  picked in index order. (These can be changed with `--randomize` and `--no-
  randomize.)

  = Output options =

  The picks can be saved in one of several `--out` output formats. The default
  "spherex" format writes the information about each sphere on a single line.
  By default this includes the center id and number of members in the sphere.
  Use `--include-members` to include the member ids and scores.

  The "centroid" format is similar to the "spherex" format, but with different
  column headers. This format matches the default "centroid" output format for
  the "chemfp butina" command, which should make it easy to swap one option in
  for the other. NOTE: a future version of the spherex will likely default to
  "centroid" output.

  The "csv" and "tsv" formats print one sphere hit on each line, in comma- or
  tab-delimited columns. By default this is only the sphere center id and its
  counts. With `--include-members` each line contain the sphere center id, the
  hit id, and its score. If a sphere contains no members (which may occur if a
  pick id is specified but the fingerprint is in another sphere) then a
  synthetic record is generated with an id of `--empty-hit-id` and score of
  `--empty-score`. Use `--no-include-empty` to skip this record.

  After sphere picking finishes, the remaining candidate fingerprints can be
  saved to the `--save-candidates` file.