chemfp butina

The “chemfp butina” command-line tool implements Butina clustering (also called Taylor-Butina clustering). See Butina on the command-line for an example. The chemfp 4.1 release notes include additional examples.

This functionality is also available from Python using the high-level chemfp.butina() function, with an example at Butina clustering.

Ths rest of this chapter contains the output from chemfp butina --help.

chemfp butina command-line options

The following comes from chemfp butina --help:

Usage: chemfp butina [OPTIONS] FILENAME

  Cluster using the Butina/leader-follower algorithm.

  If FILENAME is not specified the read the fingerprints or similarity matrix
  from stdin.

Options:
  --in [fps|fps.gz|fps.zst|fpb|flush|npz]
                                  Specify the input file format, either
                                  fingerprint or similarity matrix(default is
                                  based on filename extension, or 'fps')
  --matrix FILE
  --matrix-format [npz]           File format for --matrix (only 'npz' is
                                  supported)
  -o, --output PATH               Output filename
  --out TEXT                      Output format. Must be one of 'centroid'
                                  (the default), 'csv', 'tsv', or 'flat', with
                                  optional compression
  -j, --num-threads N             The number of threads to use. -1 means the
                                  default value (which is 8 for this
                                  computer), and can be set using
                                  $OMP_NUM_THREADS. 0 and 1 both mean single-
                                  threaded. (default: -1)
  --precision [1|2|3|4|5|6|7|8|9|10]
                                  Number of digits in Tanimoto score (default:
                                  based on the fingerprint size)
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  -t, --NxN-threshold, --threshold FLOAT
                                  Threshold when generating the NxN similarity
                                  matrix from fingerprints (default: 0.7)
  --seed N                        Specify the random number generator seed
                                  between 0 and 2**64-1, inclusive, or use -1
                                  to have one picked at random (default: -1)
  --include-members / --no-members
                                  The default writes all cluster members. With
                                  --no-members only write the cluster centers.
  --rescore / --no-rescore        Rescore moved false singletons and merged
                                  fingerprints to their new cluster center
  --renumber / --no-renumber      By default, use sequential cluster ids
                                  starting from 1. With --no-renumber use the
                                  internal cluster ids.
  --rename / --no-rename          Use --no-rename to use the internal member
                                  type names instead of renaming them to use
                                  only 'CENTER' and 'MEMBER'
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include header
                                  metadata in 'chemfp' and 'flat' output
                                  formats.
  --times / --no-times            Write timing information to stderr
  -d, --debug                     Print debug information to stderr. Use twice
                                  for more debug output.
  --help                          Show this message and exit.

NxN matrix options (for fingerprint input):
  --save-matrix, --save FILE  If specified, save the intermediate NxN matrix
                              to the named file
  --save-format [npz]         File format for --save-matrix (only 'npz' is
                              supported)

Butina clustering options:
  --tiebreaker [randomize|first|last]
                                  When multiple candidates have the same
                                  number of neighbors, 'randomize' picks the
                                  next cluster center at random while 'first'
                                  and 'last' picks next candidate in
                                  increasing or decreasing index order.
  -n, --num-clusters N            After clustering, merge smallest cluster
                                  member to other clusters until there are
                                  only N clusters  [x>=1]
  --butina-threshold FLOAT        Minimum Butina cluster threshold (default:
                                  0.0, uses the threshold from the similarity
                                  matrix)
  --false-singletons, --fs [keep|follow-neighbor|nearest-center]
                                  If 'follow-neighbor' (the default) move
                                  false singletons to the cluster of its
                                  nearest neighbor. If 'nearest-center' move
                                  to the closest center (required
                                  fingerprints). If 'keep' leave as a
                                  singleton group.

  This program implements several variations of the Butina clustering method
  described in Darko Butina's "Unsupervised Data Base Clustering Based on
  Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To
  Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci. 1999, 39,
  747-750.

  The general approach is:

  1) Generate an NxN Tanimoto similarity matrix for a given threshold

  2) Sort rows by the number of neighbors in each row, from most number of
  neighbors to least.

  By default chemfp will randomize the order of the rows for a given number of
  neighbors. This means that re-running Butina clustering will almost
  certainly give different results. Either specify the initial RNG seed using
  `--seed` or use a tiebreaker of 'first' or 'last' to use a fixed sort order.

  3) Apply sphere exclusion, in sorted order, to the sorted rows. The first
  row is the center of the first cluster, and its neighbors are members of the
  first cluster.

  4) Repeat the process until done.

  A fingerprint can only be assigned to a single cluster, and will not be used
  to create a new cluster center, nor be added to another cluster, even if it
  is sufficiently similar.

  = False Singletons =

  This process can lead to "false singletons" when a fingerprint forms a new
  cluster center but all of its neighbors are already assigned to another
  cluster.

  The chemfp butina implementation offers three possiblities for handling
  false singletons:

  * 'keep' - leave the false singleton in its own cluster.

  * 'follow-neighbor' - move the false singleton to the same cluster as its
  first nearest-neighor.

  * 'nearest-center' - move the false singleton to the nearest cluster center.

  Note: there may be multiple neighors with the same similarity as the nearest
  neighbor. Chemfp currently always arbitrarily uses the first nearest
  neighbor. A future version may support choosing the neighbor at random from
  all equally-similar neighbors.

  Note: there may be multiple cluster centers which are equally similar to a
  false singleton. Chemfp currently always arbitrarily uses one of these
  nearest neighbors. A future version may support choosing the neighbor at
  random from all equally-similar neighbors.

  = Pruning =

  If --num-clusters / -n is specified, and is smaller than the number of
  identified clusters, then chemfp will use a post-processing step to reduce
  the number of clusters.

  The clusters are ordered by size, from smallest to largest. The smallest
  cluster is selected, with ties broken by selecting the first created
  cluster. Each member is processed (from last to first) to find a nearest-
  neighbor in another cluster, with the member then added to that cluster
  before processing the next member.

  It is posssible that a fingerprint may be reassigned multiple times during
  the pruning process.

  = Fingerprints and/or npz similarity matrix  =

  The "chemfp butina" command accepts a fingerprint dataset, an npz similarity
  matrix, or both.

  When given a fingerprint dataset, it generates a sparse NxN Tanimoto
  similarity matrix with the similarity threshold given by --NxN-threshold /
  --threshold / -t. Use --save-matrix to save the matrix to an npz file.

  When given a similarity matrix, it carries out the Butina clustering on the
  matrix but operations which require fingerprints, like pruning and the
  "nearest-center" method for false singleton assignment, are not supported.
  The default --butina-threshold of 0.0 means all neighbors in the matrix will
  be used. Matrix values smaller than the Butina threshold are ignored, which
  is useful for parameter turning as a matrix can be generated once at a lower
  threshold then re-used at higher Butina thresholds.

  When given both a fingerprint data set and a sparse matrix using --matrix,
  the NxN matrix is used for the Butina clustering, and the methods which
  require fingerprints are also supported.

  = Output formats  =

  By default the clusters are written in "centroid" format to stdout. The
  format writes one line per cluster, along with a cluster member count and
  optionally including the member ids and scores.

  Use "--out" to specify alternate formats. The "flat" format is a tab-
  delimited description of the fingerprint members, one member per line in
  fingerprint order. The "csv" and "tsv" format are similar, but include the
  cluster size for each row, and are in cluster order.

  If "--out" is not specified then the format is based on the --output / -o
  filename, or "centroid" if that doesn't work.

  = Examples =

  1) Cluster fingerprints at a threshold of 0.4 (could also use '-t' or '--
  threshold'):

    chemfp butina input.fps --NxN-threshold 0.4

  2) Cluster fingerprints at a threshold of 0.4, keep false singletons as
  false singletons, write the output in 'flat' format, and use the full
  internal names, to see which centers are false singletons:

    chemfp butina input.fps -t 0.4 --false-singletons keep --no-rename --out flat

  3) Cluster fingerprints at a threshold of 0.45, move false singletons to the
  nearest cluster center, reduce the number of clusters to 20, and write the
  output in 'tsv' format:

   chemfp butina benzodiazepines.fps --threshold 0.45 \
      --fs nearest-center --num-clusters 20 --out tsv

  4) Cluster fingerprints at a threshold of 0.6, use an initial seed, save the
  intermediate NxN matrix to the file 'chembl_33_60.npz', and  write the
  Butina cluster to 'chembl_30_60.centroids':

   chemfp butina chembl_33.fpb -t 0.6 --save-matrix chembl_33_50.npz \
      -o chembl_33_60.centroids

  5) Use the saved matrix as the input to Butina clustering, with a Butina
  threshold of 0.7. Save the results to 'chembl_33_70.centroids':

   chemfp butina chembl_33_60.npz --butina-threshold 0.7 \
      -o chembl_33_70.centroids

  6) Reduce the number of identified clusters (at 0.6 threshold) from ~330K to
  250K using the pre-computed NxN similarity matrix for the clustering, and
  fingerprint searches to merge clusters:

   chemfp butina chembl_33.fpb --matrix chembl_33_60.npz \
      --num-clusters 250000 -o chembl_33_70_pruned.centroids