chemfp spherex¶
The “chemfp spherex” command-line tool implements the sphere exclusion selection algorithm. See Sphere exclusion for an example of (undirected) sphere exclusion from a set of references, and Directed sphere exclusion for an example of directed sphere exclusion.
This functionality is also available from Python using the high-level
chemfp.spherex()
function, with corresponding examples at
Sphere exclusion and Directed sphere exclusion.
The rest of this chapter contains the output from chemfp spherex --help.
chemfp spherex command-line options¶
The following comes from chemfp spherex --help
:
Usage: chemfp spherex [OPTIONS] CANDIDATES
Diversity selection using the sphere exclusion algorithm.
Options:
-t, --threshold FLOAT Maximum similarity (default: 1.0)
-n, --num-picks N Number of picks (default: 'all')
--dise Use directed sphere exclusion
--dise-references FILENAME DISE reference structures or fingerprints
(default uses the Gobbi & Lee structures)
--dise-format FORMAT Format of the DISE reference file (default
uses the file extension, else 'fps')
--ranks PATH File containing fingerprint rank values
--ranks-default N Default rank value if candidate id not found
in the ranks file (default: 2**32-1)
[0<=x<=4294967296]
--ranks-format FORMAT Format for the ranks file (can be 'tsv' or a
fingerprint format)
--ranks-has-header / --ranks-no-header
Skip the first line of the ranks file
--in, --candidates-format TEXT Format of the candidates file (default uses
filename extension, or 'fps')
--references PATH Fingerprint file containing reference
fingerprints to avoid (the fingerprints you
have)
--references-format FORMAT Format of the references file (default uses
filename extension, or 'fps')
--pick-id STR Initial candidate id (if no reference file).
Can be used more than once.
--pick-id-file PATH File containing initial candidate ids, one
per line
--randomize / --no-randomize Use --randomize (default for undirected
picking) to randomly pick from the available
candidates, or --no-randomize (default for
directed picking) to pick the candidate with
the smallest arena index.
--seed N Specify the random number generator seed
between 0 and 2**64-1, inclusive, or use -1
to have one picked at random (default: -1)
--mmap / --no-mmap Don't use mmap to read uncompressed FPB
files. May give better performance on
networked file systems, at the expense of
higher memory use.
-j, --num-threads N The number of threads to use. -1 means the
default value (which is 8 for this
computer), and can be set using
$OMP_NUM_THREADS. 0 and 1 both mean single-
threaded. (default: -1)
--include-members / --no-members
Include ids and scores for fingerprint
members in each sphere
--save-picks-format PATH Specify the format for the picked
fingerprints.
--save-candidates PATH Write remaining candidate fingerprints to
the named file.
--save-candidates-format FORMAT
Specify the format for the remaining
candidate fingerprints.
--save-picks PATH Write picked fingerprints to the named file.
--precision [1|2|3|4|5|6|7|8|9|10]
Number of digits in Tanimoto score (default:
based on the fingerprint size)
-o, --output PATH Write output to the named file instead of
stdout.
--out TEXT Output format. Must be one of 'chemfp' (the
default), 'csv', or 'tsv' with optional
compression
--include-metadata / --no-metadata
With --no-metadata, do not include header
metadata in 'spherex' or 'centroid' output
formats.
--include-empty / --no-include-empty
In csv and tsv format with --include-hits,
include picks with no hits (the default)
--empty-hit-id TEXT The hit id if --include-empty outputs a pick
with no hits (default: '*')
--empty-score TEXT The score if --include-empty outputs a pick
with no hits (default: 'NaN')
--pick-time / --no-pick-time Include the elapsed time for each pick
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2022-02-07T11:10:15') to use for the 'date'
metadata in the output header
--times / --no-times Write timing information to stderr
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--no-type-warnings Do not show fingerprint type warnings.
--show-type-warnings Show fingerprint type warnings.
--help Show this message and exit.
Select diverse fingerprints using the sphere exclusion algorithm (Hudson et
al. (1996) QSAR, https://doi.org/10.1002/qsar.19960150402) with optional
ranking for directed sphere exclusion (Gobbi and Lee (2003) JCICS,
https://doi.org/10.1021/ci025554v).
This method iteratively picks `--num-pick` / `-n` fingerprints from a set of
candidates such that the fingerprint is not within a given threshold of
similarity to any previously selected fingerprint. The default `--threshold`
of 1.0 means only identical fingerprints will be selected.
= Undirected picking =
When no ranks are specified, the fingerprints are picked at random from the
remaining candidate fingerprints. Use `--no-randomize` to select the
fingerprints in fingerprint index order, which is based on the number of
bits set in the fingerprint.
= Directed picking =
In directed picking, the fingerprints are picked in rank order, from
smallest rank to largest. If multiple fingerprints have the same rank then
by default the first is used. Use `--randomize` to randomize the order in a
rank.
There are several ways to specify the ranks. The `--dise` option uses the
three SMILES from the DISE paper by Gobi and Lee to generate reference
fingerprints and rank the candidate fingerprints by successive similarity to
the references. Use a `--dise-references` file to specify different
reference structures or fingerprints.
The ranks can be specified in a `--ranks` file, in one of several formats.
The 'tsv' format contains two tab-separated columns and an optional header.
The first column is the candidate fingerprint id, the second column is its
associated rank, which must be an integer or float.
The 'txt' format contains one id per line and an optional header. The rank
is 1 for the first id, 2 for the second, and so on.
If the ranks file is a fingerprint file then the rank is 1 for the id of the
first fingerprint, 2 for the second, and so on.
= References =
Use a `--references` fingerprint file to remove all candiate fingerprints
which are within `--threshold` similarity of any of the reference
fingerprints.
= Initial picks =
The initial picks can be specified by id (this cannot be combined with
`--references`) either by using one `--pick-id` option per id, or using
`--pick-id-file`, with one id per line.
NOTE: `--pick-id-file` and a "txt"-formatted `--ranks` file are similar but
not identical. When a pick id is specified, it is always included in the
output, even that fingerprint was included in an earlier picked sphere. (In
that case its count is 0, because its sphere doesn't even include itself.)
In addition, when only some pick ids are specified then the remaining ids by
default are picked at random, while unspecified rank by default picked are
picked in index order. (These can be changed with `--randomize` and `--no-
randomize.)
= Output options =
The picks can be saved in one of several `--out` output formats. The default
"spherex" format writes the information about each sphere on a single line.
By default this includes the center id and number of members in the sphere.
Use `--include-members` to include the member ids and scores.
The "centroid" format is similar to the "spherex" format, but with different
column headers. This format matches the default "centroid" output format for
the "chemfp butina" command, which should make it easy to swap one option in
for the other. NOTE: a future version of the spherex will likely default to
"centroid" output.
The "csv" and "tsv" formats print one sphere hit on each line, in comma- or
tab-delimited columns. By default this is only the sphere center id and its
counts. With `--include-members` each line contain the sphere center id, the
hit id, and its score. If a sphere contains no members (which may occur if a
pick id is specified but the fingerprint is in another sphere) then a
synthetic record is generated with an id of `--empty-hit-id` and score of
`--empty-score`. Use `--no-include-empty` to skip this record.
After sphere picking finishes, the remaining candidate fingerprints can be
saved to the `--save-candidates` file.