chemfp butina¶
The “chemfp butina” command-line tool implements Butina clustering (also called Taylor-Butina clustering). See Butina on the command-line for an example. The chemfp 4.1 release notes include additional examples.
This functionality is also available from Python using the high-level
chemfp.butina()
function, with an example at
Butina clustering.
Ths rest of this chapter contains the output from chemfp butina --help.
chemfp butina command-line options¶
The following comes from chemfp butina --help
:
Usage: chemfp butina [OPTIONS] FILENAME
Cluster using the Butina/leader-follower algorithm.
If FILENAME is not specified the read the fingerprints or similarity matrix
from stdin.
Options:
--in [fps|fps.gz|fps.zst|fpb|flush|npz]
Specify the input file format, either
fingerprint or similarity matrix(default is
based on filename extension, or 'fps')
--matrix FILE
--matrix-format [npz] File format for --matrix (only 'npz' is
supported)
-o, --output PATH Output filename
--out TEXT Output format. Must be one of 'centroid'
(the default), 'csv', 'tsv', or 'flat', with
optional compression
-j, --num-threads N The number of threads to use. -1 means the
default value (which is 8 for this
computer), and can be set using
$OMP_NUM_THREADS. 0 and 1 both mean single-
threaded. (default: -1)
--precision [1|2|3|4|5|6|7|8|9|10]
Number of digits in Tanimoto score (default:
based on the fingerprint size)
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
-t, --NxN-threshold, --threshold FLOAT
Threshold when generating the NxN similarity
matrix from fingerprints (default: 0.7)
--seed N Specify the random number generator seed
between 0 and 2**64-1, inclusive, or use -1
to have one picked at random (default: -1)
--include-members / --no-members
The default writes all cluster members. With
--no-members only write the cluster centers.
--rescore / --no-rescore Rescore moved false singletons and merged
fingerprints to their new cluster center
--renumber / --no-renumber By default, use sequential cluster ids
starting from 1. With --no-renumber use the
internal cluster ids.
--rename / --no-rename Use --no-rename to use the internal member
type names instead of renaming them to use
only 'CENTER' and 'MEMBER'
--include-metadata / --no-metadata
With --no-metadata, do not include header
metadata in 'chemfp' and 'flat' output
formats.
--times / --no-times Write timing information to stderr
-d, --debug Print debug information to stderr. Use twice
for more debug output.
--help Show this message and exit.
NxN matrix options (for fingerprint input):
--save-matrix, --save FILE If specified, save the intermediate NxN matrix
to the named file
--save-format [npz] File format for --save-matrix (only 'npz' is
supported)
Butina clustering options:
--tiebreaker [randomize|first|last]
When multiple candidates have the same
number of neighbors, 'randomize' picks the
next cluster center at random while 'first'
and 'last' picks next candidate in
increasing or decreasing index order.
-n, --num-clusters N After clustering, merge smallest cluster
member to other clusters until there are
only N clusters [x>=1]
--butina-threshold FLOAT Minimum Butina cluster threshold (default:
0.0, uses the threshold from the similarity
matrix)
--false-singletons, --fs [keep|follow-neighbor|nearest-center]
If 'follow-neighbor' (the default) move
false singletons to the cluster of its
nearest neighbor. If 'nearest-center' move
to the closest center (required
fingerprints). If 'keep' leave as a
singleton group.
This program implements several variations of the Butina clustering method
described in Darko Butina's "Unsupervised Data Base Clustering Based on
Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To
Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci. 1999, 39,
747-750.
The general approach is:
1) Generate an NxN Tanimoto similarity matrix for a given threshold
2) Sort rows by the number of neighbors in each row, from most number of
neighbors to least.
By default chemfp will randomize the order of the rows for a given number of
neighbors. This means that re-running Butina clustering will almost
certainly give different results. Either specify the initial RNG seed using
`--seed` or use a tiebreaker of 'first' or 'last' to use a fixed sort order.
3) Apply sphere exclusion, in sorted order, to the sorted rows. The first
row is the center of the first cluster, and its neighbors are members of the
first cluster.
4) Repeat the process until done.
A fingerprint can only be assigned to a single cluster, and will not be used
to create a new cluster center, nor be added to another cluster, even if it
is sufficiently similar.
= False Singletons =
This process can lead to "false singletons" when a fingerprint forms a new
cluster center but all of its neighbors are already assigned to another
cluster.
The chemfp butina implementation offers three possiblities for handling
false singletons:
* 'keep' - leave the false singleton in its own cluster.
* 'follow-neighbor' - move the false singleton to the same cluster as its
first nearest-neighor.
* 'nearest-center' - move the false singleton to the nearest cluster center.
Note: there may be multiple neighors with the same similarity as the nearest
neighbor. Chemfp currently always arbitrarily uses the first nearest
neighbor. A future version may support choosing the neighbor at random from
all equally-similar neighbors.
Note: there may be multiple cluster centers which are equally similar to a
false singleton. Chemfp currently always arbitrarily uses one of these
nearest neighbors. A future version may support choosing the neighbor at
random from all equally-similar neighbors.
= Pruning =
If --num-clusters / -n is specified, and is smaller than the number of
identified clusters, then chemfp will use a post-processing step to reduce
the number of clusters.
The clusters are ordered by size, from smallest to largest. The smallest
cluster is selected, with ties broken by selecting the first created
cluster. Each member is processed (from last to first) to find a nearest-
neighbor in another cluster, with the member then added to that cluster
before processing the next member.
It is posssible that a fingerprint may be reassigned multiple times during
the pruning process.
= Fingerprints and/or npz similarity matrix =
The "chemfp butina" command accepts a fingerprint dataset, an npz similarity
matrix, or both.
When given a fingerprint dataset, it generates a sparse NxN Tanimoto
similarity matrix with the similarity threshold given by --NxN-threshold /
--threshold / -t. Use --save-matrix to save the matrix to an npz file.
When given a similarity matrix, it carries out the Butina clustering on the
matrix but operations which require fingerprints, like pruning and the
"nearest-center" method for false singleton assignment, are not supported.
The default --butina-threshold of 0.0 means all neighbors in the matrix will
be used. Matrix values smaller than the Butina threshold are ignored, which
is useful for parameter turning as a matrix can be generated once at a lower
threshold then re-used at higher Butina thresholds.
When given both a fingerprint data set and a sparse matrix using --matrix,
the NxN matrix is used for the Butina clustering, and the methods which
require fingerprints are also supported.
= Output formats =
By default the clusters are written in "centroid" format to stdout. The
format writes one line per cluster, along with a cluster member count and
optionally including the member ids and scores.
Use "--out" to specify alternate formats. The "flat" format is a tab-
delimited description of the fingerprint members, one member per line in
fingerprint order. The "csv" and "tsv" format are similar, but include the
cluster size for each row, and are in cluster order.
If "--out" is not specified then the format is based on the --output / -o
filename, or "centroid" if that doesn't work.
= Examples =
1) Cluster fingerprints at a threshold of 0.4 (could also use '-t' or '--
threshold'):
chemfp butina input.fps --NxN-threshold 0.4
2) Cluster fingerprints at a threshold of 0.4, keep false singletons as
false singletons, write the output in 'flat' format, and use the full
internal names, to see which centers are false singletons:
chemfp butina input.fps -t 0.4 --false-singletons keep --no-rename --out flat
3) Cluster fingerprints at a threshold of 0.45, move false singletons to the
nearest cluster center, reduce the number of clusters to 20, and write the
output in 'tsv' format:
chemfp butina benzodiazepines.fps --threshold 0.45 \
--fs nearest-center --num-clusters 20 --out tsv
4) Cluster fingerprints at a threshold of 0.6, use an initial seed, save the
intermediate NxN matrix to the file 'chembl_33_60.npz', and write the
Butina cluster to 'chembl_30_60.centroids':
chemfp butina chembl_33.fpb -t 0.6 --save-matrix chembl_33_50.npz \
-o chembl_33_60.centroids
5) Use the saved matrix as the input to Butina clustering, with a Butina
threshold of 0.7. Save the results to 'chembl_33_70.centroids':
chemfp butina chembl_33_60.npz --butina-threshold 0.7 \
-o chembl_33_70.centroids
6) Reduce the number of identified clusters (at 0.6 threshold) from ~330K to
250K using the pre-computed NxN similarity matrix for the clustering, and
fingerprint searches to merge clusters:
chemfp butina chembl_33.fpb --matrix chembl_33_60.npz \
--num-clusters 250000 -o chembl_33_70_pruned.centroids