chemfp heapsweep

The “chemfp heapsweep” command-line tool implements the heapsweep diversity selection algorithm. See Heapsweep diversity selection for an example of selecting diverse fingerprints from a set of references.

This functionality is also available from Python using the high-level chemfp.heapsweep() function, with an example at Select diverse fingerprints with Heapsweep.

The main use case is to find the globally most diverse fingerprint or fingerprints in a dataset. While it can be used to find additional fingerprints, I’m not sure the result is scientifically useful.

The rest of this chapter contains the output from chemfp heapsweep --help.

chemfp heapsweep command-line options

The following comes from chemfp heapsweep --help:

Usage: chemfp heapsweep [OPTIONS] CANDIDATES

  Diversity selection using the heapsweep algorithm.

Options:
  -t, --threshold FLOAT           Maximum similarity (default: 1.0)
  -n, --num-picks N               Number of picks (default: 'all')
  --all-equal                     Continue picking past --num-picks if the
                                  pick score is unchanged
  --in, --candidates-format TEXT  Format of the candidates file (default uses
                                  filename extension, or 'fps')
  --randomize / --no-randomize    Use --randomize (the default) to shuffle the
                                  candidates before starting MaxMin
  --seed N                        Specify the random number generator seed
                                  between 0 and 2**64-1, inclusive, or use -1
                                  to have one picked at random (default: -1)
  --mmap / --no-mmap              Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  --neighbors FILENAME            For each pick, includes the nearest neighbor
                                  and score from fingerprints in FILENAME
  --neighbors-format FORMAT       Format of the neighbors file (default uses
                                  filename extension, or 'fps')
  --save-picks PATH               Write picked fingerprints to the named file.
  --save-picks-format PATH        Specify the format for the picked
                                  fingerprints.
  --save-candidates PATH          Write remaining candidate fingerprints to
                                  the named file.
  --save-candidates-format FORMAT
                                  Specify the format for the remaining
                                  candidate fingerprints.
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  -o, --output PATH               Write output to the named file instead of
                                  stdout.
  --out TEXT                      Output format. Must be one of 'diversity'
                                  (the default), 'csv', or 'tsv' with optional
                                  compression
  --pick-time / --no-pick-time    Include the elapsed time for each pick
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help                          Show this message and exit.