What’s new in chemfp 4.1

Released 17 May 2023.

The main themes for chemfp 4.1 are “CXSMILES”, “Butina clustering”, “CSV support”, and “embrace click and required third-party dependencies.”

Required dependencies on click and tqdm

Chemfp 4.1 switched to click for command-line processing. Previously it used argparse from the Python standard library, but I found it easier to develop and maintain click-based processing.

The change resulted in a lot of minor changes. The error messages have changed. Command-line arguments are processed in a different order, which may change when errors and how are detected. Argparse supported -h as an alias for --help but click does not. Some of the exit codes have changed.

There is a subtle change in how the --delimiter and --has-header options work with the -R option for reader_args. Previously these were processed in left-to-right order. In chemfp 4.1 the delimiter and has-header options are processed after -R, and override those settings.

The change took a lot of work, but also led to some improvements. For example, rdkit2fps now accepts --morgan1 through -morgan4 as ways to set the Morgan radius (the old --morgan is now an alias for --morgan2), and the click API makes it easier to group --help options.

Chemfp specifies click as a required dependency . If click is not installed then tools like pip should install it automatically.

In fact, this is chemfp’s first required dependency. (chemfp optionally supports NumPy, SciPy and other packages if they are installed, but those are not required for core operations).

In order to avoid install-time dependencies, chemfp 4.0 included a copy of the progress bar package tqdm with the distribution, which was used if tqdm was not already installed. With chemfp 4.1, tqdm is the second required dependency, and the vendored copy of tqdm has been removed.

CXSMILES

Chemfp 4.1 by default now reads SMILES file as containing CXSMILES. A line of a CXSMILES file looks like:

C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example cxsmiles structure

with the SMILES followed by an optional CXSMILES extension followed by the title or additional columns. There must be a single space between the SMILES and the CXSMILES extension, and the extension must start and end with the pipe character |.

Use --no-cxsmiles to parse the record as a SMILES. This keeps any | term as part of the record id/title. (The default is --cxsmiles, which parses the input as CXSMILES.

These options are equivalent to the reader arg -R cxsmiles=0 (to disable CXSMILES parsing) and -R cxsmiles=1 (to enable), though they are processed after the -R option.

An alternative method to separate the structure from the identifier is to use tab-separated fields and specify ---delimiter tab.

Toolkit-specific details

CXSMILES support proved tricky to handle. For one, there is no easy way to identify the end of a CXSMILES extension without a full CXSMILES parser. Simply looking for a pipe character followed by a space isn’t enough, because a CXSMILES can recursively embed other CXSMILES, which can in turn contain spaces.

In addition, the underlying cheminformatics toolkits have different levels of support for the extensions, and don’t all handle identifiers the same way that chemfp does.

  • Open Babel does not support CXSMILES. Instead, chemfp uses hueristics to attempt to identify and remove the extension, and pass only the SMILES to Open Babel for processing.

  • CDK expects CXSMILES by default, with no option to disable the option. If --no-cxsmiles is specified then chemfp will extract the SMILES from the rest of the line and have CDK only parse the SMILES.

  • RDKit has the option to enable or disable CXSMILES processing, but doesn’t handle identifiers the same way chemfp does, so chemfp tries to process the identifier and pass the rest to RDKit.

  • OEChem uses OEFormat_SMI for SMILES files and OEFormat_CXSMILES for CXSMILES files. Chemfp will set the appropriate OEChem format flag depending on the cxsmiles option.

  • Chemfp’s “text” toolkit is not a cheminformatics-based toolkit. It can only parse records from a SMILES and SDF files and, for SMILES, figure out which is the SMILES and which is the identifier. If the “cxsmiles” reader argument is true then it will use the same hueristics as the Open Babel parser to identify the end of the CXSMILES extension.

CXSMILES output

The toolkit wrappers can write SMILES files. By default they do not include the CXSMILES extensions. Pass the writer argument cxsmiles=True to save with the CXSMILES extensions.

with T.open_molecule_writer(
     "ouput.smi", writer_args={"cxsmiles": True}) as writer:
   writer.write_molecules(mols)

New structure format specifiers

Chemfp 4.1 added the “cxsmi”, “cxsmistring”, and “sdf3k” format specifiers.

The “cxsmi” format can be used to read and write files in SMILES format. It is the same as the “smi” format except that “cxsmi” writes CXSMILES by default while “smi” does not. The “cxsmi” format is the default for filenames which use the “.cxsmi” extension.

The “cxsmistring” format parses a SMILES string with optional CXSMILES extension, but does not process any identifier which may be present.

The “sdf3k” format is the same as the “sdf” format except that “sdf3k” always writes records in V3000 format, while “sdf” only uses V3000 if required. The “sdf3k” format is the default for fileanmes which use the “.sdf3k” extension.

New SearchResult(s) attributes

The SearchResults class now has the following new attributes:

  • num_bits: the number of bits in the fingerprint

  • alpha: the Tversky alpha value, which is 1.0 for Tanimoto

  • beta: the Tversky beta value, which is 1.0 for Tanimoto

  • matrix_type: an integer enum value describing the matrix type (see below)

  • matrix_type_name: a string name describing the matrix type

  • target_start: the start position in the target arena (non-zero only if the targets was a subarena)

These were added so the SearchResults could be saved and loaded from an npz file without losing important chemfp metadata.

Previously the target_ids attribute returned all of the ids for the target areana, even if the targets was a subarena slice of the main arena. This is useful because the search result indices are given in arena coordinates, not in subarean coordinates.

However, it proved confusing when, for example, wanting a list of just the relevant target ids.

In chemfp 4.1 the target_ids now returns only the relevant target identifiers, and the new target_arena_ids returns the full identifiers from the target arena.

The matrix_type value is from a new MatrixType enum in the chemfp.search module:

>>> from chemfp.search import MatrixType
>>> list(MatrixType)
[<MatrixType.NxM: 0>, <MatrixType.NxN: 1>,
<MatrixType.NO_DIAGONAL: 2>, <MatrixType.UPPER_TRIANGULAR: 3>,
<MatrixType.STRICTLY_UPPER_TRIANGULAR: 4>]

It describes the expected content of the underlying matrix type. These are used in the new Butina implementation as a check that an imported npz file or csr matrix is the expected type. The values are:

  • “NxM” - can contain a score for any combination of query and target

  • “NxN” - the same, except the queries and targets are identical

  • “NO-DIAGONAL” - same as “NxN” except the diagonal is not included

  • “UPPER-TRIANGULAR” - same as “NxN” except limited to the upper triangle

  • “STRICTLY-UPPER-TRIANGULAR” - “upper-triangular” but without the diagonal

The corresponding matrix_type_name values are “NxM”, “NxN”, “no-diagonal”, “upper-triangular”, and “strictly-upper-triangular”.

>>> results.matrix_type
<MatrixType.NxM: 0>
>>> results.matrix_type_name
'NxM'

Chemfp currently does not generate an “upper-triangular” matrix type.

The SearchResult also has the new target_start and target_ids properties, matching the SearchResults.

Save simsearch results to “npz” files

Similarity search results can now be saved as a NumPy npz files, as a SciPy sparse csr matrix. This makes it easier to import search results into other analysis tools.

For example, I’ll do an NxN threshold search of a set of benzodiazepine fingerprints, with a cutoff of 0.4, and save the results to benzodiazepine.npz:

% simsearch --NxN -t 0.4 ../benzodiazepine.fps.gz -o benzodiazepine.npz
queries: 100%|██████████████████████████████████████| 12386/12386 [00:00<00:00, 32889.98 fps/s]

From the Python shell I’ll import the output file into SciPy and show the target indices and scores for row index 686, which has 5 elements:

>>> from scipy import sparse
>>> arr = sparse.load_npz("benzodiazepine.npz")
>>> arr
<12386x12386 sparse matrix of type '<class 'numpy.float64'>'
     with 9541836 stored elements in Compressed Sparse Row format>
>>> arr[686]
<1x12386 sparse matrix of type '<class 'numpy.float64'>'
      with 5 stored elements in Compressed Sparse Row format>
>>> arr[686].indices
array([ 744,  950, 6083,   77,  196], dtype=int32)
>>> arr[686].data
array([0.49090909, 0.48214286, 0.53225806, 0.49019608, 0.625     ])

The indices aren’t that useful on their own. You’ll likely want to map them back to the original identifiers, which you can do because chemfp stores them in the npz file as well.

This is possible because NumPy npz file is a zipfile containing NumPy arrays in “npy” format. By convention the SciPy package stores the sparse matrix information in the zipfile entries “format.npy”, “shape.npy”, “indptr.npy”, “indices.npy”, and “data.npy”, which you can see on the command-line using the zip program:

% unzip -l benzodiazepine.npz
Archive:  benzodiazepine.npz
  Length      Date    Time    Name
---------  ---------- -----   ----
      600  01-01-1980 00:00   chemfp.npy
 38167472  01-01-1980 00:00   indices.npy
    49676  01-01-1980 00:00   indptr.npy
      140  01-01-1980 00:00   format.npy
      144  01-01-1980 00:00   shape.npy
 76334816  01-01-1980 00:00   data.npy
   396480  01-01-1980 00:00   ids.npy
---------                     -------
114949328                     7 files

Chemfp adds two or three file entries to the list. The “chemfp.npy” contains chemfp-specific metadata, which I’ll get to later. If the file “ids.npy” exists then it contains an array of identifiers used for both the queries and the targets. The file entries “query_ids.npy” and “target_ids.npy” store the query and target identifiers, respectively, but the are not used here because an NxN search has the same ids for the queries and targets.

The NumPy load function returns an NpzFile object with ways to get a list of the available files and load the contents. I’ll show how that can be used to get the target ids for ths hits of row 686:

>>> import numpy as np
>>> z = np.load("benzodiazepine.npz")
>>> z.files
['chemfp', 'indices', 'indptr', 'format', 'shape', 'data', 'ids']
>>> ids = z["ids"]
>>> ids
array(['76175', '2802905', '18631850', ..., '16198616', '16153475',
       '11979873'], dtype='<U8')
>>> len(ids)
12386
>>> [ids[i] for i in arr[686].indices]
['21327973', '21327970', '13106808', '12991903', '21817691']
>>> z.close()

(The explicit NpzFile close ensures the underlying zipfile gets closed at that point, rather than depend on Python’s garbage collector. The NpzFile is also a context manager, which means you could use it in a “with” statement instead.)

chemfp.npy file entry in an npz file

The “chemfp.npy” entry in the npz file contains chemfp-specific metadata. Its a JSON dictionary, stored as a string, in a NumPy array, encoded in “npy” format, in a zipfile. While that sounds complicated, its meant to be relatively easy to load using NumPy’s load command, which the previous section used to load the array identifiers:

>>> import json
>>> import numpy as np
>>> with np.load("benzodiazepine.npz") as z:
...   d = json.loads(z["chemfp"].item())
...
>>> import pprint
>>> pprint.pprint(d)
{'alpha': 1.0,
 'beta': 1.0,
 'format': 'multiple',
 'matrix_type': 'no-diagonal',
 'method': 'Tversky',
 'num_bits': 2048}

The most important of these is format, which can be “single” or “multiple”. It is “single” if the npz file stores a single chemfp SearchResult, and “multiple” if it stores a chemfp SearchResults.

The method indicates the scores were calculated with Tversky similarity, using the alpha and beta values of 1.0. This setting is much better known as “the Tanimoto”.

The matrix_type describes the expected matrix type, described previously.

The num_bits is the number of bits in the fingerprints used to generate the Tanimoto/Tversky scores. Chemfp uses it in two places. First, it determines the default number of digits to display when printing a Tanimoto score as a decimal. All 166-bit fingerprints can be distinguished with only 5 bits of precision, while a 2048-bit fingerprint needs 7. These values are available from the get_tanimoto_precision() function:

>>> import chemfp.bitops
>>> chemfp.bitops.get_tanimoto_precision(166)
5
>>> chemfp.bitops.get_tanimoto_precision(2048)
7

Second, Tanimoto scores don’t use the full range of the 64-bit double that chemfp uses internally to store the score. If the number of fingerprint bits is small enough then it’s possible to sort the numbers by only comparing parts of the bit-representation of the score, which improves the overall sort performance. Chemfp also doesn’t use special values like the 64-bit double representations for -0.0, infinite, or “NaN”/”not a number” values, which further improves perfomance.

If you plan to generate Tanimoto similarity scores from another tool, and import them into chemfp then you should set this value appropriately. If the scores come from another sort of similarity calculation, or don’t know how the number of bits affects things, then it is safe to use the value 2**31-1, which is the default chemfp uses if num_bits is not specified.” NaN and infinity are not supported.

Working with npz files in chemfp

The new load_npz() and save_npz() functions in the chemfp.search module will load and save, respectively similarity search results to an npz file, along with the identifiers and the “chemfp.npy” metadata.

The search results can be a SearchResults returned when using multiple queries against multiple targets, or a single SearchResult returned when using a single fingerprint to search the targets, and returned as a row of a SearchResults. They can also be the high-level objects returned from chemfp.simsearch(), though only the low-level SearchResult or SearchResults are loaded, without the timing information and other metadata in the high-level objects:

>>> import chemfp
>>> results = chemfp.simsearch(queries="queries.fps", targets="targets.fps", k=5)
targets.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 33.7Mbytes/s]
queries.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 75.7Mbytes/s]
queries: 100%|█████████████████████████████████████| 100/100 [00:00<00:00, 63138.70 fps/s]
>>> results
MultiQuerySimsearch('5-nearest Tanimoto search. #queries=100,
#targets=100 (search: 1.87 ms total: 39.59 ms)', result=SearchResults(#queries=100, #targets=100))
>>>
>>> import chemfp.search
>>> chemfp.search.save_npz("k5.npz", results)
>>> k5_results = chemfp.search.load_npz("k5.npz")
>>> k5_results
SearchResults(#queries=100, #targets=100)

You can also save the results using the new save() method for the similarity search result objects:

>>> results.save("k5.npz")

Importing SciPy csr matrices to chemfp

The new from_csr() function in the chemfp.search module converts a SciPy csr (“compressed sparse row”) array into a chemfp SearchResults instance. In the following I’ll start with a regular NumPy array, turn it into a sparse matrix, then convert that to a SearchResults:

>>> import numpy
>>> import scipy.sparse
>>> import chemfp.search
>>> arr = numpy.array([[0.0, 0.0, 0.0, 0.9, 0.8], [0.7, 0.0, 0.0, 0.6, 0.0]])
>>> arr
array([[0. , 0. , 0. , 0.9, 0.8],
       [0.7, 0. , 0. , 0.6, 0. ]])
>>> csr = scipy.sparse.csr_array(arr)
>>> csr
<2x5 sparse array of type '<class 'numpy.float64'>'
      with 4 stored elements in Compressed Sparse Row format>
>>> sr = chemfp.search.from_csr(csr)
>>> sr
SearchResults(#queries=2, #targets=5)
>>> sr[0].get_indices_and_scores()
[(3, 0.9), (4, 0.8)]
>>> sr[1].get_indices_and_scores()
[(0, 0.7), (3, 0.6)]

Additional from_csr parameters let you specify the query and target ids, matrix type, the number of fingerprint bits, and the Tversky alpha and beta parameters. See the warnings earlier about how num_bits affects chemfp.

Butina clustering

Chemfp 4.1 includes Butina clustering, both from the command-line and as a high-level API function. More complete examples of both are given below. Here’s an idea to get you started, first on the command-line:

% chemfp butina chembl_30.fpb --threshold 0.7 -o chembl_30.centroids

and then using the Python API:

import chemfp
clusters = chemfp.butina(fingerprints="chembl_30.fpb",  NxN_threshold=0.7)
print(f"#clusters: {len(clusters)}")
clusters.save("chembl_30.centroids")

The basic Butina clustering algorithm is from: Darko Butina, “Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets”, Journal of Chemical Information and Computer Sciences 1999 39 (4), 747-750 DOI: 10.1021/ci9803381.

(This is sometimes referred to as “Taylor” or “Taylor-Butina” clustering due to Robin Taylor, “Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals”, Journal of Chemical Information and Computer Sciences 1995 35 (1) 59–67. DOI: 10.1021/ci00023a009. The methods are similar, but Taylor did not apply it to clustering.)

In short, compute the NxN similarity matrix. For each fingerprint, count the number of neighbors which are within some minimum similarity threshold. Sort those by count, from most neighbors to least. Pick the first fingerprint as the first cluster center, and include all of its neighbors as cluster members. Iteratively repeat until done for all remaining fingerprints which are not already assigned as members of another cluster.

Chemfp supports several variants of the Butina algorithm: 1) breaking ties in the Butina ordering, 2) dealing with false singletons, and 3) pruning the final number of clusters.

tiebreaker

The initial ordering by count, from largest to smallest, nearly always results in ties, where many fingerprints have the same number of neighbors. How are ties broken?

By default chemfp will “randomize” the order, using an initial seed from Python’s random number generator, which means each time you run the algorithm you are likely to get different results. You can specify the seed if you want consistent output, or you can choose to break ties by order in chemfp’s internal fingerprint arena, either “first” or “last”. The fingerprints are ordered from the smallest number of bits to the largest, so “first” will bias the initial pick towards small molecules, while “last” will bias it towards larger ones with more internal diversity.

Here’s an example how to specify the tiebreaker on the command-line:

% chemfp butina dataset.fps --tiebreaker first

and an example using the API, which also sets the seed:

import chemfp
chemfp.butina("dataset.fps", tiebreaker="randomize", seed=12345)

false-singletons

If there is only one fingerprint in the cluster, then that fingerprint (which is the cluster center) is termed a “singleton.” A “false singleton” is a singleton with one or more neighbors within the minimum similarity threshold but where those neighbors were all assigned to one or more other clusters.

Chemfp supports three ways to deal with a false singleton. The default is “follow-neighbor”: move the false singleton to the same cluster as its first nearest-neighor. (If there are multiple equally-nearest neighbors then currently chemfp makes an arbitrary, implementation-defined choice. A future version may break ties using a random choice.)

The second option is to “keep” the false singleton as a singleton.

The third is to move the false singleton to the cluster with the “nearest-center”. (Ties are currently broken by an arbitrary, implementation-defined choice. A future version may break ties using a random choice.) This method was described by Niklas Blomberg, David A. Cosgrove, and Peter W. Kenny, JCAMD 23, pp 513–525 (2009) doi:10.1007/s10822-009-9264-5 though chemfp’s implementation does not yet support a user-specified minimum required center threshold.

The “nearest-center” variant requires the original fingerprints in order to find the nearest cluster center. The other two options only require the NxN similarity matrix.

Here’s an example of keeping false singletons using the command-line tool, and the --fs short form of writing --false-singletons:

% chemfp butina chembl_30.fpb --threshold 0.7  --fs keep

Here’s what it looks like in the Python API:

import chemfp
clusters = chemfp.butina("chembl_30.fpb", NxN_threshold=0.7, false_singletons="keep")
clusters.save(None) # write to stdout

Pruning

Chemfp also implements a post-processing step to reduce the number of clusters. The clusters are ordered by size, from smallest to largest. The smallest cluster is selected, with ties broken by selecting the first created cluster. Each fingerprint in the cluster is processed (currently from last to first) to find its nearest-neighbor in another cluster, with the fingerprint added to that cluster before processing the next fingerprint in the smallest cluster. (If fingerprint X moved to cluster C and fingerprint Y’s nearest neigbhor is X then Y wil also be moved to cluster C.)

The post-processing step requires the original fingerprints and cannot be done with the NxN similarity matrix.

This algorithm was devised by a chemfp customer, who also funded its integration into chemfp.

Here’s a command-line example of pruning a dataset down to 1,000 clusters:

% chemfp butina dataset.fpb -t 0.6 -n 1000 -o dataset.centroids

and the equivalent version using the API:

import chemfp
with chemfp.butina("dataset.fpb", NxN_threshold=0.6, num_clusters=1000) as clusters:
    clusters.save("daset.centroids")

Other options

If fingerprints are available then the default will rescore moved fingerprints (ie, false singletons moved to a “nearest-center”, and fingerprints moved by the post-processing pruning step) to their cluster center. Rescoring can be disabled.

By default the output clusters are ordered and numbered so the largest cluster (after any moves) are first, though there’s an option to show the original Butina ordering, which chemfp uses internally. By default the cluster center is labeled “CENTER” and the others labeled “MEMBER”. If renaming is not used then other labels are “FALSE_SINGLETON” (for singletons which are cluster centers), “MOVED_FALSE_SINGLETON” for false singletons which were reassigned to another cluster), and “MERGED” for fingerprints which were moved to another cluster as part of the post-processing cluster pruning step.

Butina on the command-line

The Butina algorithm is available as a subcommand of chemfp. Using a small set of fingerprints as an example:

% cat distinct.fps
1200ea4311018100      id1
894dc00a870b0800      id2
82b180a0943d0254      id3
d04a450013d8a479      id4
8025719c2c907c2e      id5
918125cf96a2a360      id6
6b0d4215ca47fc03      id7
2eb000ce2e4337ef      id8
849a6de873554c59      id9
b67121e96f48a365      id10
2ab0f5b64cb4b8cf      id11
e7fedcefa4e7560e      id12

I’ll cluster at a threshold of 0.41:

% chemfp butina distinct.fps -t 0.41
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
i     center_id       count   members
1     id8     5       id8     1.0000  id10    0.4651  id12    0.4400  id11    0.4130  id5     0.3256
2     id9     2       id9     1.0000  id4     0.4103
3     id3     1       id3     1.0000
4     id7     1       id7     1.0000
5     id1     1       id1     1.0000
6     id2     1       id2     1.0000
7     id6     1       id6     1.0000

This output follows the usual chemfp output format style where the first line describes the format, followed by header lines starting with ‘#’, then the output data proper, using tab-separated columns.

The default uses a variable number of columns, with one line per cluster. The first column is the column id (an integer), the second is the fingerprint id for the center fingerprint, and the third is the number of elements in the cluster (including the center).

After the count are the id and score of every fingerprint in the cluster, starting with the cluster center.

Pruning the number of clusters

Use --num-clusters or simply -n option to prune the clusters down to a smaller value. In the following I’ve specified I want only 3 clusters, so four of the clusters will be pruned, and their fingerprints moved to other clusters:

% chemfp butina distinct.fps -t 0.41 --num-clusters 3
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=3601395311 false-singletons=follow-neighbor num-clusters=3 rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
i     center_id       count   members
1     id8     8       id8     1.0000  id10    0.4651  id12    0.4400  id11    0.4130  id5     0.3256  id3     0.2381  id4     0.1702  id9     0.2917
2     id6     2       id6     1.0000  id1     0.2353
3     id7     2       id7     1.0000  id2     0.2973

Alternate output formats

Some other options include disabling the metadata in the output (the lines starting with “#”), only showing the cluster centers, and specifying the initial seed for random number generation:

% chemfp butina distinct.fps -t 0.41 --no-members --no-metadata --seed 19573998
i     center_id       count
1     id8     5
2     id9     2
3     id3     1
4     id7     1
5     id1     1
6     id2     1
7     id6     1

The “flat” variant of the centroid format writes one output line for each fingerprint, with the centroid id in the first column, the fingerprint id in the second, the cluster element label in the third, and the score in the fourth:

% chemfp butina distinct.fps -t 0.41 --out flat --seed 19573998
#Centroid-flat/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
centroid      id      type    score
5     id1     CENTER  1.0000
6     id2     CENTER  1.0000
3     id3     CENTER  1.0000
2     id4     MEMBER  0.4103
1     id5     MEMBER  0.3256
7     id6     CENTER  1.0000
4     id7     CENTER  1.0000
1     id8     CENTER  1.0000
2     id9     CENTER  1.0000
1     id10    MEMBER  0.4651
1     id11    MEMBER  0.4130
1     id12    MEMBER  0.4400

The -no-rename option tells the “flat” output to use the full internal labels for each fingerprint, which in this case shows that “id5” was a false singleton moved to cluster 1, and “id1”, “id2”, “id3”, and “id4” were moved by the pruning step:

% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
   --seed 3601395311 --no-metadata --out flat --no-rename
centroid      id      type    score
2     id1     MERGED  0.2353
3     id2     MERGED  0.2973
1     id3     MERGED  0.2381
1     id4     MERGED  0.1702
1     id5     MOVED_FALSE_SINGLETON   0.3256
2     id6     CENTER  1.0000
3     id7     CENTER  1.0000
1     id8     CENTER  1.0000
1     id9     MERGED  0.2917
1     id10    MEMBER  0.4651
1     id11    MEMBER  0.4130
1     id12    MEMBER  0.4400

The “csv” and “tsv” formats are comma- and tab-separated formats which are a hybrid between the “centroid” and “flat” formats. They have one output line for each fingerprint, ordered by cluster number (largest first) and further ordered by score, with the cluster center always first.

The output fields are the cluster id, the fingerprint id, the number of elements in the cluster, the fingerprint type (“CENTER”, “MEMBER”, and more if using --no-rename), and the score.

% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
    --seed 3601395311 --out csv
cluster,id,count,type,score
1,id8,8,CENTER,1.0000
1,id10,8,MEMBER,0.4651
1,id12,8,MEMBER,0.4400
1,id11,8,MEMBER,0.4130
1,id5,8,MEMBER,0.3256
1,id3,8,MEMBER,0.2381
1,id4,8,MEMBER,0.1702
1,id9,8,MEMBER,0.2917
2,id6,2,CENTER,1.0000
2,id1,2,MEMBER,0.2353
3,id7,2,CENTER,1.0000
3,id2,2,MEMBER,0.2973

The output was formatted using Python’s csv module, with the “excel” and “excel-tab” dialects, respectively, which means the results should be easy to import into Excel. The writer may use special quoting rules should the identifier contain a comma, newline, or other special character.

Butina parameter tuning

The Butina parameters almost certainly require some tuning to find the appropriate threshold, and to decide if cluster pruning is appropriate.

The Butina algorithm is quite fast. It takes just a few seconds, in the default case, to cluster the NxN similarity matrix from the 2+ million fingerprints in ChEMBL. Most of the time is spent formatting the output correctly!

On the other hand, even with chemfp it can take 30 minutes or even a couple of hours to generate the matrix in the first place - making tuning a practice in patience!

This is where the new “npz” format comes in handy. You can generate the NxN matrix once, using the lower-bound similarity threshold for your search space, and save the result to an npz file. This pre-computed matrix can then be used as input to Butina clustering. All chemfp need to do then is count the number of entries at or above the Butina threshold, which is must faster then computing milions of Tanimoto calculations.

(Note: currently only a “no-diagonal” NxN matrix is supported. A future version may allow a full NxN matrix. Also, the entries should be ordered by decreasing score. If they are not in that order, chemfp will reorder them.)

You can use the simsearch command to generate the npz file but it’s easier to use the --save-matrix option of the Butina command. In the following I’ll write the output to /dev/null and I’ll show timing details for the individual steps:

% chemfp butina --threshold 0.4 benzodiazepine.fps.gz \
   --save-matrix benzodiazepine.npz --times -o /dev/null
load_fingerprints 0.06 NxN 0.50 save_matrix 2.79 cluster 0.00
  rescore 0.00 output 0.01 total 3.43

What took the longest was the time to write the matrix to an npz file. The relatively low threshold of 0.4, combined with the high similarity of the benzodiazepines, mean there are 9,541,836 non-zero entries out of 153,400,610 possible non-zero values (the diagonals are not included), giving a density of 6.2%.

I can now cluster with the saved NxN matrix instead of using fingerprints, first, using all of the entries in the matrix:

% chemfp butina benzodiazepine.npz --times -o /dev/null
load_matrix 0.25 cluster 0.00 output 0.01 total 0.34

and then using --butina-threshold to change the minimum similarity threshold for the Butina algorithm:

% chemfp butina benzodiazepine.npz --times -o /dev/null \
   --butina-threshold 0.5
load_matrix 0.25 cluster 0.00 output 0.02 total 0.35

Number of Butina clusters in ChEMBL

Those timing numbers probably don’t look that impressive for the small benzodiazepine data set. I’ll switch to “chembl_30_60.npz”, which contains the NxN similarity matrix for CheMBL 30 at threshold of 0.60 using RDKit’s default Morgan2 fingerprints, and see how the number of clusters changes for for 0.60, 0.61, 0.62, … 0.70. I can get this number by looking at the last line of the output. I’ll also show the breakdown of the times for each clustering run:

% chemfp butina chembl_30_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 0.77 cluster 0.35 output 3.11 total 4.51
327022        CHEMBL4790513   1       CHEMBL4790513   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.61 --times | tail -1
load_matrix 0.73 cluster 0.37 output 2.96 total 4.35
354362        CHEMBL1980515   1       CHEMBL1980515   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.62 --times | tail -1
load_matrix 0.75 cluster 0.39 output 2.96 total 4.40
377894        CHEMBL1477846   1       CHEMBL1477846   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.63 --times | tail -1
load_matrix 0.74 cluster 0.42 output 3.00 total 4.47
405647        CHEMBL1497171   1       CHEMBL1497171   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.64 --times | tail -1
load_matrix 0.78 cluster 0.45 output 3.12 total 4.67
433390        CHEMBL287360    1       CHEMBL287360    1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.65 --times | tail -1
load_matrix 0.75 cluster 0.45 output 3.05 total 4.56
464418        CHEMBL3472504   1       CHEMBL3472504   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.66 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4235635.50/s]
load_matrix 0.75 cluster 0.50 output 3.22 total 4.79
499199        CHEMBL1387287   1       CHEMBL1387287   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.67 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4124939.25/s]
load_matrix 0.74 cluster 0.52 output 3.13 total 4.70
536382        CHEMBL471678    1       CHEMBL471678    1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.68 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4162832.16/s]
load_matrix 0.74 cluster 0.51 output 3.17 total 4.73
573835        CHEMBL1402501   1       CHEMBL1402501   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.69 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4030396.68/s]
load_matrix 0.74 cluster 0.53 output 3.24 total 4.81
617398        CHEMBL4552562   1       CHEMBL4552562   1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.70 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 3597378.39/s]
load_matrix 0.76 cluster 0.59 output 3.34 total 4.99
657984        CHEMBL1724866   1       CHEMBL1724866   1.0000000

The progress bars appear once clustering takes more than 0.5 seconds. You can use --progress or --no-progress to always enable or diable them, or set the environment variable “CHEMFP_PROGRESS” to “1” or “0”.

Each of these clustering jobs took about 5 seconds to run, with most of the time spent writing the output (!). That’s a lot of overhead when I only want the count.

High-level Butina API

You can also do Butina clustering using chemfp’s Python API, using the new chemfp.butina() function, as in the following:

>>> import chemfp
>>> clusters = chemfp.butina(matrix="chembl_30_60.npz" , butina_threshold=0.70, progress=False)
>>> clusters
ButinaClusters('2136187x2136187 matrix, butina-threshold=0.7
tiebreaker=randomize seed=4095063231
false-singletons=follow-neighbor, took 694.85 ms', result=[657865
clusters], picker=...)

There’s a lot going on there! I loaded the npz file to use as the search matrix, then carried out a Butina clustering with a threshold of 0.7. I also disabled the progress bars, which currently don’t have the 0.5 second delay.

The ButinaClusters includes timing information. This analysis did not save the results, so it was much faster at about 0.7 seconds. For a more detailed description, see

The returned ButinaClusters repr(), which is designed for interactive use like this, summarizes the analysis parameters, and gives a list-like API to the 657,870 clusters found.

>>> clusters.butina_threshold
0.7
>>> clusters.seed
4095063231
>>> clusters[0]
ButinaCluster(cluster_idx=0, #members=1540)
>>> clusters[0].members[:10]
[2123772, 2119185, 2119187, 2119188, 2119190, 2119192, 2119199, 2119202, 2119206, 2119215]
>>> [m.target_ids[i] for i in clusters[0].members[:3]]
['CHEMBL3958999', 'CHEMBL3907845', 'CHEMBL3908293']

The API makes it easy (and significantly faster!) to produce the parameter scan of the previous section. One optimization is to load the matrix once, so it can be re-used for each of the butina calls:

>>> import chemfp.search
>>> m = chemfp.search.load_npz("chembl_30_60.npz")
>>> m
SearchResults(#queries=2136187, #targets=2136187)
>>> m.matrix_type
<MatrixType.NO_DIAGONAL: 2>
>>> min(row.min() for row in m if row)
0.6
>>> max(row.max() for row in m if row)
1.0
>>> chemfp.butina(matrix=m, progress=False)
ButinaClusters('2136187x2136187 matrix, tiebreaker=randomize
seed=2054201120 false-singletons=follow-neighbor, took 513.50 ms',
result=[327076 clusters], picker=...)

With that in place, I’ll count the number of Butina clusters for every threshold from 0.6 to 0.7:

>>> import time
>>> if 1:
...   start_time = time.time()
...   for threshold in range(60, 71, 1):
...     t = threshold/100.0
...     print(t, len(chemfp.butina(matrix=m, butina_threshold=t, progress=False)))
...   print("Elapsed time:", time.time()-start_time)
...
0.6 326954
0.61 354204
0.62 377855
0.63 405589
0.64 433320
0.65 464492
0.66 499330
0.67 536273
0.68 573910
0.69 617330
0.7 657888
Elapsed time: 6.543123960494995

That’s significantly faster!

Use the ButinaClusters.save() method to save the results to a file. The output format can be specified directly, or inferred by the filename extension:

>>> clusters.save("output.txt")
>>> open("output.txt").readline()
'#Centroid/1 include-members=1\n'
>>> clusters.save("output.flat")
'#Centroid-flat/1 include-members=1\n'
>>> clusters.save("output.txt", format="flat")
>>> open("output.txt").readline()
'#Centroid-flat/1 include-members=1\n'
>>> clusters.save("output.csv")
>>> open("output.csv").readline()
'cluster,id,count,type,score\n'

Butina clustering with fingerprints

When fingerprints are passed to the butina function, it will generate the NxN similarity matrix at the given threshold (default: 0.7) before doing the clustering:

>>> with chemfp.butina(fingerprints="benzodiazepine.fps.gz", NxN_threshold=0.9) as clusters:
...   print(f"#clusters: {len(clusters)}")
...   print(f"#elements in largest cluster: {len(clusters[0])}")
...   ids = clusters.matrix.target_ids
...   print("ids:", ", ".join(ids[i] for i in clusters[0]))
...
#clusters: 9845
#elements in largest cluster: 13
ids: 9806545, 15978353, 18346686, 18346717, 21941544, 9914920,
21941549, 23376564, 9806982, 9828811, 9850353, 21941556, 21941589

Some of the Butina clustering variants require fingerprints and cannot work with only a similarity matrix. These can both be passed to the function, in this case using filenames:

>>> chemfp.butina(matrix="benzodiazepine.npz", false_singletons="nearest-center")
Traceback (most recent call last):
     ...
ValueError: Cannot use false_singletons='nearest-center' without fingerprints
>>> chemfp.butina(matrix="benzodiazepine.npz", fingerprints="benzodiazepine.fps.gz",
...   false_singletons="nearest-center")
ButinaClusters('12386 fingerprints, NxN-threshold=0.7
tiebreaker=randomize seed=3668915671 false-singletons=nearest-center
rescore=1, took 333.22 ms', result=[314 clusters], picker=...)

Butina clusters and pandas

The ButinaClusters.to_pandas() method for the clusters returned by the chemfp.butina() function exports the fingerprint details to a pandas data frame, with one row per fingerprint:

>>> import chemfp
>>> clusters = chemfp.butina(matrix="benzodiazepine.npz")
>>> clusters.to_pandas()
       cluster        id    type     score
1632         1  10572033  CENTER  1.000000
1720         1  18968942  MEMBER  1.000000
5632         1  18665677  MEMBER  0.814815
5637         1  18968923  MEMBER  0.814815
4441         1  10223808  MEMBER  0.811321
...        ...       ...     ...       ...
9995       225  20360404  CENTER  1.000000
9645       225  20360405  MEMBER  0.984127
11260      226  20360429  CENTER  1.000000
11069      226  20360430  MEMBER  0.985294
12385      227  11979873  CENTER  1.000000

[12386 rows x 4 columns]

The method takes several parameters, including the column titles to use, and the choice to rename the type to “CENTER” and “MEMBER” or to keep the internal names, like the following which shows only the moved false singletons:

>>> clusters.to_pandas(columns=["i", "ID", "type", "score"],
...   rename=False).query("type == 'MOVED_FALSE_SINGLETON'")
         i        ID                   type     score
7097     1  16001255  MOVED_FALSE_SINGLETON  0.786885
2421     1  10980601  MOVED_FALSE_SINGLETON  0.711538
12088    1  20711717  MOVED_FALSE_SINGLETON  0.636364
12341    1   9854335  MOVED_FALSE_SINGLETON  0.631579
3046     1  20063249  MOVED_FALSE_SINGLETON  0.629630
...    ...       ...                    ...       ...
12210  118  21797426  MOVED_FALSE_SINGLETON  0.509091
7844   124  25262692  MOVED_FALSE_SINGLETON  0.690141
7245   144  21489312  MOVED_FALSE_SINGLETON  0.588235
3526   149  19865801  MOVED_FALSE_SINGLETON  0.560606
7302   162  22153741  MOVED_FALSE_SINGLETON  0.746032

[86 rows x 4 columns]

If you want to be really hard-core, you can dig down to the low-level fingerprint assignments:

>>> clusters.assignments.to_pandas()
       assignment_type  cluster_idx     score
0                    2           11  0.477273
1                    2           11  0.413043
2                    2           11  0.500000
3                    2           11  0.500000
4                    2           11  0.404255
...                ...          ...       ...
12381                2          121  0.402878
12382                2          121  0.402878
12383                2          134  0.493243
12384                2          134  0.439490
12385                1          226  1.000000

[12386 rows x 3 columns]

or work with them directly, in-place, as NumPy array with a complex dtype:

>>> clusters.assignments.as_numpy()
array([(2,  11, 0.47727273), (2,  11, 0.41304348), (2,  11, 0.5       ),
       ..., (2, 134, 0.49324324), (2, 134, 0.43949045),
       (1, 226, 1.        )],
      dtype={'names': ['assignment_type', 'cluster_idx', 'score'],
      'formats': ['<i4', '<i4', '<f8'], 'offsets': [0, 4, 8],
      'itemsize': 16, 'aligned': True})

This part of the API is still under development and aspects may change in the next release.

Changed default output format name

The command-line tools simsearch, maxmin, spherex, and heapsweep, have a new default format name. The default output format hasn’t changed, only its name, specified by the –out option or the filename extension. The previous default format name was “chemfp”. The new defaults are:

  • “simsearch” for simsearch

  • “diversity” for maxmin and heapsweep

  • “spherex” for spherex

These names corresponding to the name used in the first line of their respective outputs.

This change paves the way for supporting new output formats, without having “chemfp” mean different things depending on the context.

For backwards compatibility, the old “chemfp” format is still recognized. It should not be used in new code.

Output metadata options

The chemfp formats generally start with “metadata” lines which describe the output format, the parameters used to generate the output, the relevant software versions, and so on. They are meant to provide data provenance, while being easy to filter out - the metadata lines all start with a “#” while the data lines do not.

If this provenance data is not needed, for example, in a data pipeline where the results are immediately consumed by another program, then use the new --no-metadata option to omit metadata from the output. The default is --include-metadata.

% echo "C methane" | cdk2fps --MACCS --no-metadata
000000000000000000000000000000000000008000    methane

The chemfp.open_fingerprint_writer() now supports an include_metadata parameter. If False, the header metadata is not included in the FPS output.

>>> import chemfp
>>> with chemfp.open_fingerprint_writer(None, include_metadata=False) as writer:
...   writer.write_fingerprint("AB123", b"\xfa\x49")
...
fa49  AB123

There are new command-line for the “date” line in the metadata. The default saves the current time (in UTC) to the output, which means the output is not bytewise reproducible, which is sometimes useful.

The new --date option lets you specify the date to use, as an ISO date string. The new --no-date omits the date line from the metadata.

NOTE: A future version of chemfp will support local time zones in FPS files. This will use the zoneinfo module added to Python 3.9, with fallback to tzdata for Python 3.8. (Note: Python 3.8 end-of-life for Python core maintainer support is October 2024.)

spherex changes

There are several changes related to chemfp’s spherical exclusion implementation, called “spherex”.

Multi-threaded

For each iteration, the spherex algorithm first picks a fingerprint (using one of several methods) then finds all other remaining fingerprints which are sufficiently close to the newly picked fingerprint, and assigns them to a new cluster.

In chemfp 4.1 the “finds all other remaining fingerprints” step has been parallelized using OpenMP.

The default number of threads depends on the OpenMP configuration, and likely the number of available cores. I’ve found that you can use several times more threads than cores and still get useful speedup, so if performance is important you should try larger values.

As with simsearch, you can specify the number of threads using the OMP_NUM_THREADS environment variable. You can also set it on the spherex command-line using the --num-threads/-j option, and with the chemfp.spherex() API using the num_threads parameter.

Parallelism introduces some non-determinism in the order that sphere members are added to a cluster. The diversity result classes now implement PicksAndScores.reorder() and Neighbors.reorder() with the same orderings as SearchResult.reorder(), which can be used to sort the list by decreasing score, and generate a canonical ordering.

The new PicksAndScores.move_pick_index_to_first() function moves the given member to the first element. Together with the reorder function these can be used to put the list in canonical order, with the cluster center as the first element.

Output format

By default the spherex command generates output in “spherex” format (previously called “chemfp” format). While the format hasn’t changed, there are additional guarantees about the default output order. The sphere center is always the first element in the members list (if present), and the remaining members are always given in a canonical order, sorted by decreasing score.

Spherex now also supports the “centroid” output format, matching the same format for Butina clustering. The idea is to make it easier to switch between the two algorithms.

In this release “centroid” output by default only includes the picked sphere centers. Use --include-members to include all of the sphere members in the output. NOTE! the default “centroid” output from chemfp butina is currently different than the one from chemfp spherex. The butina output includes both the center and the members, and requires the --no-members option to only output the centers.

The long-term goal is to make these the same, where the default includes the members. For future compatibility, please use --no-members if you really want the current default.

The --include-members and --no-members options are aliases for the older --include-hits and --no-hits options. The older options are currently supported, for backwards compatibility, but hidden from the --help. They should not be used and will likely eventually be removed.

spherex does not yet implement the “flat” centroid format of the butina command.

New ranking formats

There are two new ways to specify fingerprint ranks for directed sphere exclusion when using the --ranks option. Instead of specifying a rank order or weight they order the picks by file position - the first pick is the first id in the file, the second pick is the next available id, and so on.

The “txt” format contains one line per id, with an optional header. Use --ranks-has-header if the first line is a header.

Alternatively you can specify a fingerprint file, which will be processed in input order. (As a reminder, by default FPB files are sorted in popcount order. It’s possible to store the FPB file in input order using the fpcat command-line tool, but in that case it’s probably better to use the original FPS directly.)

The fingerprint ranking option exists for cases where the fingerprints in the candidate file are in rank order. In that case, the file is specified once for the candidates and again for --ranks:

chemfp spherex dataset.fps --ranks dataset.fps

csv background

Here is some background about what “csv” means in a chemfp context. It sets the stage for the next section.

The term csv comes from the initialism for “Comma-Separated Values”, but is also used to describe several other related formats, like tab-separated values. I will use to mean the files that can be parsed with Python’s csv module.

In chemfp, the “csv” format is the dialect of the comma-separated format following the quoting rules the Excel uses. (Quoting rules are used to, for example, allow a comma in a field without it being intepreted as a separated.) It is the same as Python’s “excel” dialect, which chemfp accepts as an alias.

In chemfp, the “tsv” format is the dialect of the tab-separated format following the quoting rules the Excel uses. It is the same as Python’s “excel-tab” format, which chemfp accepts as an alias.

Python’s csv module lets you specify your own csv dialect, with options for which delimiter character to use, how escaping and quoting work, and whether or not initial spaces should be ignored. These are:

  • delimiter: The character used to separate fields

  • escapechar: On reading, the escapechar removes any special meaning from the following character

  • quotechar: A one-character string used to quote fields containing special characters

  • quoting: Controls if the reader should recognize quoting

  • skipinitialspace: If True, ignore leading spaces in a field.

For details see the Python documentation,

The chemfp csv2fps command-line tool lets you specify these via command-line options.

It starts with the specified --dialect (defaulting to “csv”), then applies the other specified terms, like --separator. Certain characters are difficult to write on the command-line so the --separator, --quotechar and --escapechar also accept the aliases “tab”, “backslash”, “space”, “quote”, “doublequote”, “singlequote”, or “bang”.

The get_dialect() function in the new chemfp.csv_readers module is the equivalent in the Python API. The returned CSVDialect can be used with both the chemfp API and Python’s csv module.

“sniffing”

The csv module has a way to “sniff” the dialect. It reads the start of the file and uses a number of heuristics to determine the dialect parameters, and to determine if the first line is a header.

Chemfp implements the special “sniff” dialect which will use these method to get the (hopefully) correct settings.

My experience has been a bit iffy, and it strongly depends on the dataset you have. It’s best used with chemfp csv2fps --describe to get an initial idea of the actual format than to use it as the default dialect.

MolPort csv file

Some datasets are in csv format with a structure (usually SMILES or InChI) stored in one of the columns, and an identifier stored in another. One such example is the MolPort distribution, which I’ll use for these examples.

It is a tab-delimited format with a header line containing titles, followed by the data. Here’s an example, which unfortunately overflows the available display space, so I’ve trimmed it the first 70 or so characters:

% gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | head -2
SMILES    SMILES_CANONICAL    MOLPORTID    STANDARD_INCHI    INCHIKEY...
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1   Cl.CCC(C(=O)OC1CC2CCC(C...

That’s still enough to see there are a couple of SMILES fields, a MolPort id, and an InChI. You can also see it’s a tab-separated file, though I’ve converted them to multiple spaces for this output.

There are a lot of tools for working with CSV files, including the popular xsv and its variants. Those are very fast and flexible, with ways to filter, join, and manipulate the data. For example, the following line for the Bash shell will select the SMILES and MOLPORTID columns (from a tab-delimited file), then re-format the comma-delimited output to tab-delimited, to get a tab-delimited SMILES file with a header:

% gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz |  \
     xsv select SMILES,MOLPORTID -d $'\t' | xsv fmt -t $'\t' | head -3
SMILES        MOLPORTID
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1      MolPort-009-675-630
CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1   MolPort-009-675-631

This uses the Bash’s ANSI-C quoting to insert the tab character. There are other ways to insert the tab character, depending on your shell and/or terminal emulator.

csv2fps command

While it’s not hard to convert a CSV file to SMILES for input to chemfp fingerprint generation, with chemfp 4.1 you can now process CSV files directly using the chemfp csv2fps command.

I’ll use “fulldb_smiles-009-000-000–009-499-999.txt.gz” from the MolPort distribution, but because that’s rather long I’ll refer to it using the “$FILENAME” environment variable.

csv2fps --describe

When getting started with a new CSV file, use the special --describe option to have chemfp describe what it thinks the file contains. By default it parses the file as the “csv” dialect, that is comma-delimited, with a header. It tries to show the titles and columns for the first data line:

% chemfp csv2fps --describe $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: csv
  delimiter: ',' (default)
  quotechar: '"' (default)
  escapechar: None (default)
  doublequote: True (default)
  skipinitialspace: False (default)
  quoting: minimal
has_header: True
Titles:
  #1: 'SMILES\tSMILES_CANONICAL\tMOLPORTID\tSTANDARD_INCHI\tINCHIK ... '
Columns for the first row:
  #1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1\tCl.CCC(C(=O)OC1CC ... '
  #2: '14-17H'
  #3: '3'
  #4: '9-12H2'
  #5: '1-2H3;1H\tVEUYDINLFCGWRM-UHFFFAOYSA-N\tin stock'

Long strings are trimmed, and replace with the “…”, so they easily fit into an 80 column display.

The output fields are printed using Python’s string encoding. The many “\t” characters is strong evidence that the file is tab-separated.

I can use the “sniff” dialect to have Python’s csv module use heuristics to identify the dialect:

% chemfp csv2fps --describe --dialect sniff $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect:
  delimiter: '\t'
  quotechar: '"' (default)
  escapechar: None (default)
  doublequote: False
  skipinitialspace: False (default)
  quoting: minimal
has_header: True
Titles:
  #1: 'SMILES'
  #2: 'SMILES_CANONICAL'
  #3: 'MOLPORTID'
  #4: 'STANDARD_INCHI'
  #5: 'INCHIKEY'
  #6: 'AVAILABILITY'
Columns for the first row:
  #1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1'
  #2: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)c1ccccc1'
  #3: 'MolPort-009-675-630'
  #4: 'InChI=1S/C18H25NO2.ClH/c1-3-17(13-7-5-4-6-8-13)18(20)21-16- ... '
  #5: 'VEUYDINLFCGWRM-UHFFFAOYSA-N'
  #6: 'in stock'

or I can specify the “tsv” dialect, which gives the same titles and same columns for the first line.

Use --no-header if the CSV file does not start with a header. For this file that option produces the following:

% chemfp csv2fps --describe --dialect tsv $FILENAME --no-header
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: tsv
  delimiter: '\t'
  quotechar: '"' (default)
  escapechar: None (default)
  doublequote: True (default)
  skipinitialspace: False (default)
  quoting: minimal
has_header: False
Titles: not available
Columns for the first row:
  #1: 'SMILES'
  #2: 'SMILES_CANONICAL'
  #3: 'MOLPORTID'
  #4: 'STANDARD_INCHI'
  #5: 'INCHIKEY'
  #6: 'AVAILABILITY'

csv2fps and column specifiers

The default for the csv2fps command is to parse the file in “csv” dialect (or “tsv” if the file ends in “.tsv”), treat the first line as a header, get the identifers from the first column and SMILES from the second column, and generate RDKit Morgan2 fingerprints. I know the file is tab-delimited (see the previous section), so I’ll specify “tsv” format and show the first two lines of fingerprint output:

% chemfp csv2fps -d tsv $FILENAME | head -8 | fold -w 70
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
 useBondTypes=1
#software=RDKit/2021.09.4 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T09:10:17
0200000000000000000001000000800000000800000000000000000000000000000000
0040000000000004000000002020000000000000000000080000000000000000000000
4000000000000000000000040000000080000000000000000000000000008800000040
0000000000000000000040800000000000000000000008000000040202000001000000
4000000200000000008000000000000000000000001040000020000000002000100000
0000000000000000000000000000000000040100000020000000000002000000000000
0000000000000000400000000000000000000000000000000200000040300000000000
0000000000000000000000        Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1
0200000000000000008001000000200000000000000000000000001000002000000000
0800000000000000000000000020000000000000000000000000000000000000000000
0020000000000000000000040000000000000000000001000000000000408000000000
0000020000000000000000000000000000000000000002000000000204000001004000
0000000001000000008000000000000000000000000000000000000000002000100000
0000000000000000000000000000000000000000000020000000000108000000000000
0080000000000000400400000000000000000000000000010200000000200002000000
0000000000400000000000        CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1

This ended up interpreting the “‘SMILES” column as the id, and the “SMILES_CANONICAL” column as the SMILES, which is why csv2fps is able to generate output, though likely most people want the id from the MOLPORTID column.

Use --id-column (or --id-col for short) to specify the id column to use. In the following I’ll specify it by title, and I’ll disable the header lines (the “metadata”) using the --no-metadata, so the output contains just the first two lines of fingerprint data:

% chemfp csv2fps -d tsv $FILENAME --id-col MOLPORTID --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000    MolPort-009-675-630
02000000000000000080010000002000000000000000000000000010000020000000000800000000
00000000000000002000000000000000000000000000000000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000000200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000000000000000000001020000000020
00020000000000000000400000000000    MolPort-009-675-631

You can only specify a column title if the file contains headers.

You can also specify the column index by column number, starting from 1. I’ll use --molecule-column (or --mol-col for short) to read the SMILES from column #2 (“SMILES_CANONICAL”), and --id-col to read the SMILES from column #3:

% chemfp csv2fps -d tsv $FILENAME --id-col 3 --mol-column 2 --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000   MolPort-009-675-630
02000000000000000080010000002000000000000000000000000010000020000000000800000000
00000000000000002000000000000000000000000000000000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000000200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000000000000000000001020000000020
00020000000000000000400000000000   MolPort-009-675-631

If the specified column name is an integer then it is treated as an index, with 1 indicating the first column (0 and negative values are out of range).

If the first character of the column specifier is an “@” then the rest of the specifier is used as the column title. For example, if the second column is titled “10” then use “@10” as the column specifier. If the column title is “@A” then use “@@A” as the column specifier.

If the first character of the column specifier is a “#” then the rest of the specifier must be an integer, which is interpreted as the column index. For example, use “#3” to match the third column. If the column title starts with a “#” then use the “@” prefix (eg, if the column is “#entry” then specify it as “@#entry”).

Specify the csv2fps input structure format

By default mol2fps parses the molecule column as SMILES (in “smi” format) . Use --format to specify another format. For example, the MolPort file column STANDARD_INCHI contains an InChI string, and the INCHIKEY column contains the corresponding InChIKey. I’ll read the structures from the former column, and the id from the latter, and have chemfp parse the molecule as “inchi”:

% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
  --format inchi --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000      VEUYDINLFCGWRM-UHFFFAOYSA-N
02000000000000000080010000000000000000000000000000000000000000000010000800000000
00000000000000002000000000000000000000000000000800000000000000000004000000000000
00000000000000000000210000000000000000008000000000000002000000004000000000000000
00000000000000020000000002040000010000000000000000000000008000000000000000000000
00000000000000000000200010000000000010000000000000000000000000000100000000200000
00000000000000000000000000000008000440040800080000000000000000000001020000000200
00000000000000000000000000000000      BAHWBLGZSGIKQC-UHFFFAOYSA-N

If you leave out the format, or specify “smi” format, then it will try to interpret the InChI strings as SMILES, which will fail. As a safety mechanism, csv2fps will exit if none of the molecules in the first 100 records can be parsed. That looks like:

% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
    --format smi --no-metadata | head -2 | fold
  ... many error lines removed ...
[11:56:09] SMILES Parse Error: syntax error while parsing: InChI=1S/C10H8C
l2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)
[11:56:09] SMILES Parse Error: Failed parsing SMILES 'InChI=1S/C10H8Cl2N4/
c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)' for in
put: 'InChI=1S/C10H8Cl2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5
H,(H3,13,14,15,16)'
Error: Each of the first 100 records contained structure errors.
  Final error: RDKit cannot parse the SMILES 'InChI=1S/C10H8Cl2N4/c11-6-3-1-
2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)', column 'STANDAR
D_INCHI' (#4), record 100, line 101, 'fulldb_smiles-009-000-000--009-499-9
99.txt.gz'.
Exiting.

Use the -R command-line option to specify toolkit- and format-specific reader arguments. As a short-cut to enable or disable CXSMILES extension support, use --cxsmiles (enabled by default) and --no-cxsmiles.

structure formats with embedded identifiers

The previous examples read the id from one column, and the molecule from another, as a SMILES or InChI string. Sometimes the column contains a more complex record, like a SMILES record with both a SMILES string and identifier, or even as an embedded SDF. (The CSV format supports multi-line column entries.)

Use the --id-from-molecule option (or --id-from-mol for short) to have csv2fps get the identifier from the parsed molecule record. For a SMILES or InChI record this defaults to the rest of the column entry. This can be changed with the --delimiter option.

If the column entry is an SDF record then by default the identifier comes from the title line. Use --id-tag to get the identifier from the named data value.

Specify an alternate fingerprint type

The examples so far generated RDKit Morgan fingerprints. To use an alternative name specify a chemfp type string using --type or use the --with option to get the fingerprints from the metadata of a fingerprint file.

The following uses Open Babel to generate FP2 fingerprints:

% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type OpenBabel-FP2 \
   | head -8 | fold
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/3.1.0 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:21:04
00260200002002000600060000020000001000100000000000000080010400001000500080000001
100a001820010100010200000020c004000000020000a20808100800000004400000880002800005
0090015000003008c008080000080000400000200000000010000000400008000007010000011000
0000000140000000      MolPort-009-675-630
00060000004004200b0006000001000c0000009000000000000000c0000800001800408300000000
00862038000100800102000008000000000000020000000000000400000400000020000002880006
00b0004000003008c0080900800800000c1000220000000000000000000000000006010400011000
0032800000000000      MolPort-009-675-631

The following uses CDK to generate CDK’s own implementation of the 881-bit PubChem fingerprints:

% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type CDK-Pubchem \
   | head -8 | fold
#FPS1
#num_bits=881
#type=CDK-Pubchem/2.0
#software=CDK/2.8 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:22:50
005e0c002000000000000000000000000080060000003c0200000000000000800000007800000000
00b03c8719604c10c10020001140044b100040000004000010118010001110044c01a90861040064
03801111e01913077101000000000000000000000000000000000000000000        MolPort-009-675-
630
00de0c000000000000000000000000000000000000000c0600000000000000800200007800080000
0030148319204c00410300001140844a080041000004000010118111201110064c01898c29041006
69001111e01811017100000000000000000000000000000000000000000000        MolPort-009-675-
631

The chemfp type string may contain fingerprint parameters. The following uses OEChem to generate OEGraphSim circular fingerprints, folded to 128 bit, instead of the default of 4096 bits:

% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 \
   --type "OpenEye-Circular numbits=128" | head -8 | fold
#FPS1
#num_bits=128
#type=OpenEye-Circular/2 numbits=128 minradius=0 maxradius=5 atype=Arom|AtmNum|C
hiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:24:44
905a04205c4473c607f95c06123b2026      MolPort-009-675-630
a07873c1407c5c7617830d8428910170      MolPort-009-675-631

Finally, here’s an example using --with to get the fingerprint file from a file:

% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --using reference.fps | \
        head -7 | fold
#FPS1
#num_bits=4096
#type=OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 atype=Arom|AtmNum|
Chiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:29:18
00000000000000000000000000000000000000000000080000000010000000000000000800000000
00200000000040000000000000000000000000000000000000000000000008000004000001080000
00000200000000000000000200000000000000000000100000000000002000200000000000010800
00008000000000000000000000000000000000008000040000000000040000000000000000002000
00008000020000000000000000000000000000040000000000000000000000000000000000800000
00000000000080000000000000000000000000000000000000000000000000000000000000000000
00800000000002000000000000000020002000000100000000080000000000000000000000000000
00000000000000000000000000000000000020000000000000000000000000000000080000000000
00000000000080008000000000000008000000002400000000000000000000000000000000020000
00000000000000004000000100100000000048000801840000000000001000000000020000000000
00000000000000000000000000000000000000000000000000000000000000000000000000001000
20008000000000000000000000200000000000000000000000000000000000000400000000000000
0020000804000000000000000000000000000000000000000000000000000000      MolPort-
009-675-630

CSV files with fingerprint data

Instead of storing a molecule, the CSV file might contain fingerprint data, represented in one of various encodings like hex or base64.

If this is the case, use --fingerprint-column (or --fp-column or --fp-col for short) to indicate that the CSV file contains fingerprint data, and to indicate which column contains that data.

csv2fps supports the same decoder options as sdf2fps. The default, --hex, treats the fingerprint as hex-encoded with bit - as the first bit of the first byte. Use chemfp csv2fps --help for the full list of fingerprint decoder options.

csv character encoding

The “character set” describes how the text in the CSV file is encoded as a sequence of bytes. By default csv2fps assumes the CSV file is UTF-8 encoded.

This may be a problem with some programs which write data in another encoding, like “utf16” and “cp1252” for Windows or “latin1” for older Unix tools.

Use --encoding to specify the input encoding, which must be one of the supported Python encoding.

Use --encoding-errors to describe how to handle input which could not be decoded. This must be one of the supported Python encoding handlers “strict”, “ignore”, “replace”, or “backslashreplace”.

csv2fps processing errors

csv2fps can run into two main types of problems when processing a row of the file: 1) the row may not have enough colums, and 2) the toolkit may be unable to parse the molecule column or the fingerprint column.

The --errors option specifies how to handle molecule and fingerprint parse errors. The default is “report” for molecules and “strict” for fingerprints. The “strict” handler prints an error message and stops processing. The “report” handler prints an error message and keeps on processing. The “keep” handler simply keeps on processing.

The --csv-errrors option specifies how to handle CSV parse errors. The default is “strict”, which requires the id and molecule or fingerprint column to exist, or print an error message and exit if not. It can also be “report” and “keep”.

csv2fps TODO

There is currently no high-level chemfp.csv2fps Python function which is equivalent to the command-line tool. This will be added in a future version of chemfp. Until then you can use a CSV reader described below.

chemfp.csv_readers module

Chemfp 4.1 added the csv_readers module with functions for working with CSV files.

To start with, there is a utility function for specifying the csv dialect. It starts with a base dialect and lets you modify different values. The result can be used as a chemfp dialect and as input to Python’s own csv module.

>>> from chemfp import csv_readers
>>> csv_readers.get_dialect("csv")
CSVDialect(delimiter=',', quotechar='"', escapechar=None,
doublequote=True, skipinitialspace=False, quoting=0 ('minimal'),
lineterminator='\r\n', strict=False)
>>> csv_readers.get_dialect("csv", doublequote=False)
CSVDialect(delimiter=',', quotechar='"', escapechar=None, \
doublequote=False, skipinitialspace=False, quoting=0 ('minimal'),
lineterminator='\r\n', strict=False)

The lowest-level interface reads a CSV file, parses any available header, and iterates over rows in the file.

>>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz"
>>> reader = csv_readers.read_csv_rows(filename, dialect="tsv")
>>> reader.titles
['SMILES', 'SMILES_CANONICAL', 'MOLPORTID', 'STANDARD_INCHI',
'INCHIKEY', 'AVAILABILITY']
>>> next(reader)
['Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1', 'Cl.CCC(C(=O)OC1CC2CCC
(C1)N2C)c1ccccc1', 'MolPort-009-675-630', 'InChI=1S/C18H25NO2.ClH/c
1-3-17(13-7-5-4-6-8-13)18(20)21-16-11-14-9-10-15(12-16)19(14)2;/h4-
8,14-17H,3,9-12H2,1-2H3;1H', 'VEUYDINLFCGWRM-UHFFFAOYSA-N', 'in sto
ck']
>>> reader.close()

It also supports a context-manager, if you don’t want an explicit close() nor want to depend on Python’s garbage collector.

The read_csv_ids_and_molecules_with_parser() function gives a higher-level, albeit somewhat complicated, way to iterate over (id, molecule) pairs in the file. It requires a function which takes the text of the molecule field and returns an (id, molecule) pair – the id is only used if id_column is None:

>>> def convert(mol_text):
...    return (None, (len(mol_text), mol_text))
...
>>> reader = csv_readers.read_csv_ids_and_molecules_with_parser(
...   filename, convert, dialect="tsv", id_column=3, mol_column=1)
>>> reader
CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol
from column 'SMILES' (column #1), dialect='tsv', record_format=None,
filename='fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> for title in reader.titles: print(repr(title))
...
'SMILES'
'SMILES_CANONICAL'
'MOLPORTID'
'STANDARD_INCHI'
'INCHIKEY'
'AVAILABILITY'
>>> next(reader)
('MolPort-009-675-630', (40, 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1'))

This is easily modified to return only the molecule text rather than the (length, text) pair.

The columns can be specified as integers (the value 1 means the first column) or strings (to use the named column title).

This API is designed to be compatible with the “id and molecule” parser create by the toolkit wrappers:

>>> from chemfp import openeye_toolkit
>>> converter = openeye_toolkit.make_id_and_molecule_parser("smi")
>>> reader = csv_readers.read_csv_ids_and_molecules_with_parser(
...   filename, converter, dialect="tsv", id_column=3, mol_column=1)
>>> next(reader)
('MolPort-009-675-630', <oechem.OEGraphMol; proxy of <Swig Object
of type 'OEGraphMolWrapper *' at 0x10e184db0> >)

NOTE: If you want to use a toolkit wrapper to parse molecules from a CSV file then you almost certainly want to use the new - and simpler - toolkit wrapper function read_csv_ids_and_molecules.

Finally, use read_csv_ids_and_fingerprints() to read the ids and encoded fingerprints from a file. The default uses the “hex” decoder (chemfp.bitops.hex_decode()) but you can pass in your own function returning the number of bits (which can be None) and the decoded fingerprint as a byte string:

>>> def decode_fp(fp_text):
...   return 32, x.to_bytes(4, "big")
...
>>> reader
CSVFingerprintReader(id from column 'MOLPORTID' (column #3), fp from
column 'SMILES_CANONICAL' (column #2), dialect='tsv', filename=
'fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> next(reader)
('MolPort-009-675-630', b'\x00\x00\x00*')

New toolkit wrapper functions to read CSV files

The toolkit wrappers have a new read_csv_ids_and_molecules() function which reads ids and molecules from a CSV file, using a specified structure format.

>>> from chemfp import rdkit_toolkit as T
>>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz"
>>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv",
...   id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi")
>>>
>>> reader
CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol
from column 'STANDARD_INCHI' (column #4), dialect='tsv', record_format=
'inchi', filename='fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> next(reader)
('MolPort-009-675-630', <rdkit.Chem.rdchem.Mol object at 0x10b0077d0>)

This works for all of the supported cheminformatics toolkits:

>>> from chemfp import rdkit_toolkit as T
>>> from chemfp import openbabel_toolkit as T
>>> from chemfp import openeye_toolkit as T
>>> from chemfp import cdk_toolkit as T

There is currently no direct way to read the molecules and convert them to fingerprints as an (id, fingerprint) pair. You will need to handle the conversion yourself, like the following:

>>> from chemfp import rdkit_toolkit as T
>>> fptype = T.morgan(fpSize=128)
>>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv",
...   id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi")
>>>
>>> fptype = T.morgan(fpSize=128)
>>> id_fp_iter = ((id, fptype.from_mol(mol)) for (id, mol) in reader)
>>> next(id_fp_iter)
('MolPort-009-675-630', b'&\x15HD\xca\xa2\xc0\x00A\x00o\x02P\x00\xc0:')

The next version of chemfp will likely have a function for this operation.

translate command

Chemfp uses a “toolkit wrapper” to provide a consistent API for reading and writing structure files for each of the supported toolkits, and for generating fingerprints using the toolkit-specific fingerprint generators.

Chemfp 4.1 adds a new chemfp translate command-line tool which uses that wrapper API to read structure files in one format and write them to a file in another format.

If you do not specify which toolkit to use (via the --in-toolkit option, or -in-tk or -T for short), chemfp will use the value of the “CHEMFP_TOOLKIT” environment variable, which must be a comma-separated list of toolkit names. If that variable is not available then it uses the string “rdkit,openbabel,openeye,cdk”, that is, it first checks for the RDKit toolkit, then the Open Babel toolkit, then OpenEye’s OEChem toolkit, and finally the CDK toolkit (via the jpype Python/Java adapter).

These examples will use the RDKit toolkit.

If unspecified, the translate command reads and writes in “smi” format, including possible CXSMILES extensions, and the output will be canonical SMILES but without CXSMILES:

% echo "OCC ethyl alcohol" | chemfp translate
CCO ethyl alcohol
% echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | chemfp translate
C[C@H](F)[C@H](C)[C@@H](C)Br example

Open Babel generate the same canonical SMILES:

% echo "OCC ethyl alcohol" | chemfp translate -T openbabel
CCO ethyl alcohol

Use --in and --out to specify the input and output format:

% echo "OCC phenol" | chemfp translate --out sdf3k
phenol
     RDKit

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 3 2 0 0 0
M  V30 BEGIN ATOM
M  V30 1 O 0.000000 0.000000 0.000000 0
M  V30 2 C 0.000000 0.000000 0.000000 0
M  V30 3 C 0.000000 0.000000 0.000000 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$
% echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | \
    chemfp translate --out cxsmi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example

or depend on the input and filename extension:

% chemfp translate phenol.smi -o phenol.inchi
% cat phenol.inchi
InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 phenol

While chemfp can convert between different formats, it is not meant as a replacement for a tool like Open Babel, which is able to add or remove hydrogens, assign coordinates, and more as part of the translation process.

Translate with the ‘text’ toolkit

Chemfp includes the “text” toolkit, which knows just enough about SMILES and SDF formatted files to read and write records. It is not a cheminformatics toolkit and it does not handle chemistry or format conversion.

The text toolkit, as part of chemfp 4.1 support for CXSMILES, now supports heuristics which try to identify the CXSMILES extension in the SMILES file record.

This means the text toolkit can be used with translate as a way to remove the CXSMILES extensions from a SMILES file, without fully processing the SMILES string into a molecule object:

% cat test.smi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example
% chemfp translate test.smi -T text
C[C@H](F)[C@H](C)[C@@H](C)Br example
% chemfp translate test.smi -T text --out cxsmi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example

Unlike the chemistry toolkits, which only support UTF8-encoded input formats, the text toolkit supports other extended ASCII encodings, like latin1, which means it can be used to convert between two different encodings:

% chemfp translate -T text --in-encoding latin1 latin1.sdf -o utf8.sdf

However, you should almost certainly use a dedicated conversion tool for this purpose, like iconv:

% iconv -f latin1 -t utf-8 latin1.sdf > utf8.sdf

Translate via an intermediate format

The translate command can also do the conversion in two steps, that is, read the record in the input format, save it internally to an intermediate “via” format (which defaults to “sdf”), which is then parsed and written in the output format.

For example, if the input file is in OpenEye’s “oeb” format, and you want to convert it to Java code which will create the corresponding CDK molecule, then you can use OEChem to convert the OEB input to SDF and have CDK convert the SDF to “cdkjava” format, with the following:

% chemfp translate -T openeye theobromine.oeb -U cdk --out cdkjava
{
  IChemObjectBuilder builder = DefaultChemObjectBuilder.getInstance();
  IAtomContainer mol = builder.newInstance(IAtomContainer.class);
  IAtom a1 = builder.newInstance(IAtom.class,"C");
  a1.setFormalCharge(0);
  a1.setPoint2d(new Point2d(2.1348, 0.7541));
  mol.addAtom(a1);
     ....

Admittedly this is mostly a technically interesting gimmick as I haven’t come up with a good example where this would be a useful way to convert between two chemistry formats.

One place it might be useful is to handle a legacy Latin-1 encoded structure file (or one which uses another extended ASCII encoding), and convert it to another chemistry format, like the following, which uses the “text” toolkit to read “latin1.sdf”, which is Latin-1 encoded, into an intermediate UTF-8-encoded SDF, then have RDKit translate the result into SMILES, sending the results to stdout:

% chemfp translate -T text --in-encoding latin1 -U rdkit latin1.sdf

translate_record function

The toolkit wrappers have a new translate_record function which convert a single record from one format to another format. By default is parses from “smi” to a molecule then creates and returns in “smi” format. The following shows how the four supported cheminformatics toolkits canonicalize phenol:

>>> import chemfp
>>> from chemfp import rdkit_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'
>>> from chemfp import openeye_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'c1ccc(cc1)O phenol\n'
>>> from chemfp import openbabel_toolkit as T
>>T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'
>>> from chemfp import cdk_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'C1=CC=C(C=C1)O phenol\n'

while here you can see they all generate the same InChI string:

>>> import chemfp
>>> for T in (chemfp.rdkit, chemfp.openeye, chemfp.openbabel, chemfp.cdk):
...   print(T.name.rjust(10),
...         T.translate_record("c1ccccc1", in_format="smistring", out_format="inchistring"))
...
     rdkit InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
   openeye InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
 openbabel InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
       cdk InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H

Structure I/O helper functions

Chemfp 4.0 introduced a few helper functions to make it easier to read and write structure files, and process structure records, when the format was known.

These were:

  • from_smi(): parse a SMILES record into a molecule

  • from_smi_file(): read a SMILES file iterating over molecules

  • to_smi(): convert a molecule to SMILES

  • to_smi_file(): open an output file for writing molecule as SMILES

plus equivalents for SDF and InChI, and functions for reading and writing SMILES string and InChI strings (these only contain a single struture, with neither identifier nor newline).

These names proved hard to remember because they were not consistent with the rest of the chemfp API and did not include enough information about what they did.

For example, if I wanted to read a SMILES file I would start read_ then use tab-complete to see what was available. That would give me read_ids_and_molecules and read_molecules, but not show from_smi_file.

Even if I remembered to use from_smi_file, I would forget if the iterator only returned molecules, or the more useful (id, molecule) pairs.

I would also forget that from_smi was to read a record in “smi” format, and not (as I thought) to read a file in “smi”.

The new shortcut functions in chemfp 4.1 are:

  • read_smi_ids_and_molecules(): read a SMILES file iterating over (id, molecule) pairs.

  • read_smi_molecules(): parse a SMILES file iterating over molecules

  • parse_smi(): parse a “smi” record into a molecule

  • create_smi(): convert a molecule to a “smi” record

  • open_smi_writer(): open an output file for writing molecule as SMILES

These are also available for “sdf” and “inchi” formats, as well as the parse/create functions for “smistring” and “inchistring”, which are the format names for just the SMILES and InChI components of the “smi” and “inchi” records, respectively.

A few other formats, like “helm”, and “fasta” are present on a toolkit-specific basis, but are probably not useful and are likely to be removed in the future unless people tell me they like them.

In addition, “*_from_string()` and “*_to_string()” variants are also now supported, like read_smi_ids_and_molecules_from_string() and open_smi_writer_to_string().

The old shortcuts are still available but generate a PendingDeprecationWarning. They will be removed in a future version of chemfp.

Other API changes

Chemfp uses a random number generator for some of the diversity selection and clustering options. The RNG is initialized with a 64-bit integer seed. If the seed is not specified then chemfp uses Python’s RNG to generate the seed at random (Python’s RNG uses a number of methods to generate its own seed). Chemfp 4.0 requested a 64-bit number from Python, which on average takes about 20 characters to represent in the output parameters, which drew the eye far more than it was useful. Chemfp 4.1 requests a 32-bit numbers as its initial RNG, which shouldn’t meaningfully affect the scientific results.

The FingerprintArena.copy() command now accepts ids. If specified, it contains the list of ids to use in the copy. This is especially useful when making a subset given a list of indices, and you want to be able to map the new subset back to the original index. In that case use:

b = a.copy(indices=offsets, ids=offsets)

so the fingerprint at position i in the new arena was at position b.ids[i] in the original arena.

The diversity pickers now support a “max_time” parameter, to stop picking after a specified amount of seconds (given as a 64-bit float). The timeout is checked after each pick has finished. This helps gives smooth progress bars or other interactive feedback about the picking process,

DEPRECATION NOTICE: The bitops.byte_difference() and bitops.hex_difference() functions have been renamed bitops.byte_xor() and bitops.hex_xor(), mostly because I couldn’t remember what “difference” meant but I knew exactly what “xor” meant. The old functions still work, but generate a PendingDeprecationWarning. They will likely be removed in a future version of chemfp.