chemfp maxmin¶
The “chemfp maxmin” command-line tool implements the MaxMin diversity selection algorithm. See MaxMin diversity selection for an example of selecting diverse fingerprints from a set of references, and MaxMin diversity selection including references for an example of selecting diverse fingerprints from a set of references which are also diverse from a set of candidates.
This functionality is also available from Python using the high-level
chemfp.maxmin()
function, with an example at Select diverse fingerprints with MaxMin.
The rest of this chapter contains the output from chemfp maxmin --help and chemfp maxmin --help-formats.
chemfp maxmin command-line options¶
The following comes from chemfp maxmin --help
:
Usage: chemfp maxmin [OPTIONS] CANDIDATES
Diversity selection using the MaxMin algorithm.
Options:
-n, --num-picks N Number of picks (default: 'all')
-t, --threshold FLOAT Maximum similarity (default: 1.0)
--all-equal Continue picking past --num-picks if the
pick score is unchanged
--pick-id STR Candidate id to use for the initial pick
(default: use heapsweep)
--pick-index INTEGER Candidate index to use for the initial pick
(default: use heapsweep)
--in, --candidates-format TEXT Format of the candidates file (default uses
filename extension, or 'fps')
--references PATH Fingerprint file containing reference
fingerprints to avoid (the fingerprints you
have)
--references-format FORMAT Format of the references file (default uses
filename extension, or 'fps')
--randomize / --no-randomize Use --randomize (the default) to shuffle the
candidates before starting MaxMin
--seed N Specify the random number generator seed
between 0 and 2**64-1, inclusive, or use -1
to have one picked at random (default: -1)
--neighbors FILENAME For each pick, includes the nearest neighbor
and score from fingerprints in FILENAME
--neighbors-format FORMAT Format of the neighbors file (default uses
filename extension, or 'fps')
--mmap / --no-mmap Don't use mmap to read uncompressed FPB
files. May give better performance on
networked file systems, at the expense of
higher memory use.
--save-picks PATH Write picked fingerprints to the named file.
--save-picks-format PATH Specify the format for the picked
fingerprints.
--save-candidates PATH Write remaining candidate fingerprints to
the named file.
--save-candidates-format FORMAT
Specify the format for the remaining
candidate fingerprints.
--precision [1|2|3|4|5|6|7|8|9|10]
Number of digits in Tanimoto score (default:
based on the fingerprint size)
-o, --output PATH Write output to the named file instead of
stdout.
--out TEXT Output format. Must be one of 'diversity'
(the default), 'csv', or 'tsv' with optional
compression
--pick-time / --no-pick-time Include the elapsed time for each pick
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--times / --no-times Write timing information to stderr
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--help-formats Describe the output formats.
--help Show this message and exit.
The MaxMin algorithm iteratively picks fingerprints from a set of candidates
such that the newly picked fingerprint has the smallest Tanimoto similarity
compared to any previously picked fingerprint, and optionally also the
smallest Tanimoto similarity to the reference fingerprints.
This process is repeated until `-n`/`--num-picks` fingerprints have been
picked, or until the remaining candidates are greater than
`-t`/`--threshold` similar to the picked fingerprints, or until no
candidates are left. For example, to select all fingerprints with a maximum
Tanimoto score of 0.2 use `--threshold 0.2`. The fingerprints are
selected from the CANDIDATES file, which should be in FPS or FPB format with
optional compression. Use `--in` / `--candidates-format` to specify the
format, otherwise maxmin will infer it from the filename extension, or
default to "fps".
By default the initial pick is selected using the heapsweep algorithm, which
finds a fingerprint with the smallest maximum Tanimoto to any other
fingerprint. Use `--pick-id` to specify the first fingerprint by id, or
`--pick-index` by index. (In practice fingerprint 0 is often the most
diverse fingerprints.)
If `--references` is specified then any picked candidate fingerprint must
also be dissimilar from all of the fingerprints in the reference
fingerprints. The model behind the terms is that you want to pick diverse
fingerprints from a vendor catalog which are also diverse from your in-house
reference compounds. Use `--references-format` to specify the file format
instead of letting maxmin infer it from the filename extension.
The candidates are shuffled before the MaxMin algorithm starts, to give a
sense of how MaxMin is affected by arbitrary tie-breaking. Use `--no-
randomize` to disable shuffling, otherwise the default is to `--randomize`.
For reproducibility use `--seed` to specify the seed for the psuedo random
number generator or -1 to use a random seed. The seed value used is
available in the output metadata.
If the `--neighbors` fingerprint file is specified then for each pick maxmin
will search for a 1-nearest neighbor and include the neighbor's id and score
in the output.
Use `-o`/`--output` to write the pick information to a file instead of to
stdout. The three supported formats are "diversity" (the default), "csv",
and "tsv". Use `--out` to specify the format, otherwise it will be inferred
from the filename extension, and default to "diversity". The `--date` and
`--no-date` options affect the "diversity" metadata.
See `--help-formats` for details.
By default the Tanimoto scores will be formatted with the minimum number of
digits needed to distinguish every possible Tanimoto score for the given bit
size. Use `--precision` to change that value.
Use `--pick-time` to also include the total elapsed time needed to make each
pick.
Use `--save-picks` to write the picked fingerprints to a file, in `--save-
picks-format` format if specified, otherwise based on the filename
extension, or default to "fps".
Use `--save-candidates' to write the remaining (unpicked) candidates to a
file, in `--save-candidates-format` format if specified, otherwise based on
the filename extension, or default to "fps".
A progress bar will be shown unless the output is a terminal. Use
`--progress` to always include a progress bar, or `--no-progress` to disable
the progress bar. Alternatively set $CHEMFP_PROGRESS to "on", "off", or the
number of seconds to delay until showing a progress bar.
Examples:
1) Find the 5 most diverse fingerprints in ChEMBL 33
% chemfp maxmin chembl_33.fpb -n 5
#Diversity/1
#num_bits=2048
#type=maxmin threshold=1.0 num-picks=5 all-equal=0 randomize=1 seed=4011161669
#software=chemfp/4.2
#candidates=chembl_33.fpb
#date=2024-05-31T11:53:42
i pick_id score
1 CHEMBL2105487 0.0000000
2 CHEMBL3690458 0.0000000
3 CHEMBL4300465 0.0000000
4 CHEMBL2227836 0.0000000
5 CHEMBL1200718 0.0000000
2) Find the 5 most diverse fingerprints in ChEMBL 33 which are also diverse
from ChEMBL 32, report them in csv format, and include timing details for
each pick and the overall process:
% chemfp maxmin chembl_33.fpb --references chembl_32.fpb -n 5 --out csv \
--pick-time --times
pick_id,score,pick_time
CHEMBL5172589,0.2250000,24.90
CHEMBL5183404,0.2315789,31.40
CHEMBL5189138,0.2361111,36.65
CHEMBL5170888,0.2500000,59.62
CHEMBL5190323,0.2526316,61.99
T_init: 0.05 T_pick: 61.99 #picks: 5 picks/s: 0.08 T_total: 62.06
Supported maxmin formats¶
The following comes from chemfp maxmin --help-formats
:
The "chemfp maxmin" command supports three output formats: "diversity", "csv",
and "tsv".
The "diversity" format follows the same form as the FPS format. It is line-
oriented with a header followed by the picks.
% chemfp maxmin distinct.fps -n 2
#Diversity/1
#num_bits=64
#type=maxmin threshold=1.0 num-picks=2 all-equal=0 randomize=1 seed=4211965637
#software=chemfp/4.2
#candidates=distinct.fps
#date=2024-05-31T11:35:47
i pick_id score
1 id1 0.0000
2 id5 0.0513
The header contains a "magic" line describing the format and version, followed
by key/value metadata fields. The "type" contains the parameters used to
generate the results. The other lines should be self-explanatory. Use `--no-
date` to exclude the "date" line, or `--date` to specify a given date.
The pick results come after the header, in tab-separated format. The first
line contains the column headers, followed by the data values.
The columns depend on what maxmin options are used. The default columns are:
* i - the pick number, starting at 1
* pick_id - the identifier for the picked fingerprint
* score - the maximum Tanimoto similarity between the pick and any previous pick
In a `--neighbors` search, the two additional columns are:
* neighbor_id - the identifier for the selected nearest fingerprint
* neighbor_score - the Tanimoto similarity with that fingerprint
If `--pick-time` is included then the aditional column is:
* pick_time - the total elapsed time since the start of processing, in seconds
The "csv" and "tsv" formats contain only the pick information, as comma-
separated or tab-separated values, formatted for import by Excel and other
spreadsheets. The "i" column is omitted.
% chemfp maxmin distinct.fps -n 2 --out csv
pick_id,score
id1,0.0000
id5,0.0513