chemfp maxmin¶

The “chemfp maxmin” command-line tool implements the MaxMin diversity selection algorithm. See MaxMin diversity selection for an example of selecting diverse fingerprints from a set of references, and MaxMin diversity selection including references for an example of selecting diverse fingerprints from a set of references which are also diverse from a set of candidates.

This functionality is also available from Python using the high-level chemfp.maxmin() function, with an example at Select diverse fingerprints with MaxMin.

The rest of this chapter contains the output from chemfp maxmin --help and chemfp maxmin --help-formats.

chemfp maxmin command-line options¶

The following comes from chemfp maxmin --help:

Usage: chemfp maxmin [OPTIONS] CANDIDATES

  Diversity selection using the MaxMin algorithm.

Options:
  -n, --num-picks N               Number of picks (default: 'all')
  -t, --threshold FLOAT           Maximum similarity (default: 1.0)
  --all-equal                     Continue picking past --num-picks if the
                                  pick score is unchanged
  --pick-id STR                   Candidate id to use for the initial pick
                                  (default: use heapsweep)
  --pick-index INTEGER            Candidate index to use for the initial pick
                                  (default: use heapsweep)
  --in, --candidates-format TEXT  Format of the candidates file (default uses
                                  filename extension, or 'fps')
  --references PATH               Fingerprint file containing reference
                                  fingerprints to avoid (the fingerprints you
                                  have)
  --references-format FORMAT      Format of the references file (default uses
                                  filename extension, or 'fps')
  --randomize / --no-randomize    Use --randomize (the default) to shuffle the
                                  candidates before starting MaxMin
  --seed N                        Specify the random number generator seed
                                  between 0 and 2**64-1, inclusive, or use -1
                                  to have one picked at random (default: -1)
  --neighbors FILENAME            For each pick, includes the nearest neighbor
                                  and score from fingerprints in FILENAME
  --neighbors-format FORMAT       Format of the neighbors file (default uses
                                  filename extension, or 'fps')
  --mmap / --no-mmap              Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  --save-picks PATH               Write picked fingerprints to the named file.
  --save-picks-format PATH        Specify the format for the picked
                                  fingerprints.
  --save-candidates PATH          Write remaining candidate fingerprints to
                                  the named file.
  --save-candidates-format FORMAT
                                  Specify the format for the remaining
                                  candidate fingerprints.
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  -o, --output PATH               Write output to the named file instead of
                                  stdout.
  --out TEXT                      Output format. Must be one of 'diversity'
                                  (the default), 'csv', or 'tsv' with optional
                                  compression
  --pick-time / --no-pick-time    Include the elapsed time for each pick
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help-formats                  Describe the output formats.
  --help                          Show this message and exit.

  The MaxMin algorithm iteratively picks fingerprints from a set of candidates
  such that the newly picked fingerprint has the smallest Tanimoto similarity
  compared to any previously picked fingerprint, and optionally also the
  smallest Tanimoto similarity to the reference fingerprints.

  This process is repeated until `-n`/`--num-picks` fingerprints have been
  picked, or until the remaining candidates are greater than
  `-t`/`--threshold` similar to the picked fingerprints, or until no
  candidates are left. For example, to select all fingerprints with a maximum
  Tanimoto score of 0.2 use `--threshold 0.2`.      The fingerprints are
  selected from the CANDIDATES file, which should be in FPS or FPB format with
  optional compression. Use `--in` / `--candidates-format` to specify the
  format, otherwise maxmin will infer it from the filename extension, or
  default to "fps".

  By default the initial pick is selected using the heapsweep algorithm, which
  finds a fingerprint with the smallest maximum Tanimoto to any other
  fingerprint. Use `--pick-id` to specify the first fingerprint by id, or
  `--pick-index` by index. (In practice fingerprint 0 is often the most
  diverse fingerprints.)

  If `--references` is specified then any picked candidate fingerprint must
  also be dissimilar from all of the fingerprints in the reference
  fingerprints. The model behind the terms is that you want to pick diverse
  fingerprints from a vendor catalog which are also diverse from your in-house
  reference compounds. Use `--references-format` to specify the file format
  instead of letting maxmin infer it from the filename extension.

  The candidates are shuffled before the MaxMin algorithm starts, to give a
  sense of how MaxMin is affected by arbitrary tie-breaking. Use `--no-
  randomize` to disable shuffling, otherwise the default is to `--randomize`.
  For reproducibility use `--seed` to specify the seed for the psuedo random
  number generator or -1 to use a random seed. The seed value used is
  available in the output metadata.

  If the `--neighbors` fingerprint file is specified then for each pick maxmin
  will search for a 1-nearest neighbor and include the neighbor's id and score
  in the output.

  Use `-o`/`--output` to write the pick information to a file instead of to
  stdout. The three supported formats are "diversity" (the default), "csv",
  and "tsv". Use `--out` to specify the format, otherwise it will be inferred
  from the filename extension, and default to "diversity". The `--date` and
  `--no-date` options affect the "diversity" metadata.

  See `--help-formats` for details.

  By default the Tanimoto scores will be formatted with the minimum number of
  digits needed to distinguish every possible Tanimoto score for the given bit
  size. Use `--precision` to change that value.

  Use `--pick-time` to also include the total elapsed time needed to make each
  pick.

  Use `--save-picks` to write the picked fingerprints to a file, in `--save-
  picks-format` format if specified, otherwise based on the filename
  extension, or default to "fps".

  Use `--save-candidates' to write the remaining (unpicked) candidates to a
  file, in `--save-candidates-format` format if specified, otherwise based on
  the filename extension, or default to "fps".

  A progress bar will be shown unless the output is a terminal. Use
  `--progress` to always include a progress bar, or `--no-progress` to disable
  the progress bar. Alternatively set $CHEMFP_PROGRESS to "on", "off", or the
  number of seconds to delay until showing a progress bar.

  Examples:

  1) Find the 5 most diverse fingerprints in ChEMBL 33

    % chemfp maxmin chembl_33.fpb -n 5
    #Diversity/1
    #num_bits=2048
    #type=maxmin threshold=1.0 num-picks=5 all-equal=0 randomize=1 seed=4011161669
    #software=chemfp/4.2
    #candidates=chembl_33.fpb
    #date=2024-05-31T11:53:42
    i     pick_id score
    1     CHEMBL2105487   0.0000000
    2     CHEMBL3690458   0.0000000
    3     CHEMBL4300465   0.0000000
    4     CHEMBL2227836   0.0000000
    5     CHEMBL1200718   0.0000000

  2) Find the 5 most diverse fingerprints in ChEMBL 33 which are also diverse
  from ChEMBL 32, report them in csv format, and include timing details for
  each pick and the overall process:

    % chemfp maxmin chembl_33.fpb --references chembl_32.fpb -n 5 --out csv \
           --pick-time --times
    pick_id,score,pick_time
    CHEMBL5172589,0.2250000,24.90
    CHEMBL5183404,0.2315789,31.40
    CHEMBL5189138,0.2361111,36.65
    CHEMBL5170888,0.2500000,59.62
    CHEMBL5190323,0.2526316,61.99
    T_init: 0.05 T_pick: 61.99 #picks: 5 picks/s: 0.08 T_total: 62.06

Supported maxmin formats¶

The following comes from chemfp maxmin --help-formats:

The "chemfp maxmin" command supports three output formats: "diversity", "csv",
and "tsv".

The "diversity" format follows the same form as the FPS format. It is line-
oriented with a header followed by the picks.

  % chemfp maxmin distinct.fps -n 2
  #Diversity/1
  #num_bits=64
  #type=maxmin threshold=1.0 num-picks=2 all-equal=0 randomize=1 seed=4211965637
  #software=chemfp/4.2
  #candidates=distinct.fps
  #date=2024-05-31T11:35:47
  i     pick_id score
  1     id1     0.0000
  2     id5     0.0513

The header contains a "magic" line describing the format and version, followed
by key/value metadata fields. The "type" contains the parameters used to
generate the results. The other lines should be self-explanatory. Use `--no-
date` to exclude the "date" line, or `--date` to specify a given date.

The pick results come after the header, in tab-separated format. The first
line contains the column headers, followed by the data values.

The columns depend on what maxmin options are used. The default columns are:

  * i - the pick number, starting at 1
  * pick_id - the identifier for the picked fingerprint
  * score - the maximum Tanimoto similarity between the pick and any previous pick

In a `--neighbors` search, the two additional columns are:

  * neighbor_id - the identifier for the selected nearest fingerprint
  * neighbor_score - the Tanimoto similarity with that fingerprint

If `--pick-time` is included then the aditional column is:

  * pick_time - the total elapsed time since the start of processing, in seconds

The "csv" and "tsv" formats contain only the pick information, as comma-
separated or tab-separated values, formatted for import by Excel and other
spreadsheets. The "i" column is omitted.

  % chemfp maxmin distinct.fps -n 2 --out csv
  pick_id,score
  id1,0.0000
  id5,0.0513