What’s new in chemfp 4.2

Released 10 July 2024.

The main additions to chemfp 4.2 are:

  • the “simarray” methods to generate a complete (dense) all-by-all NumPy matrix using several comparison metrics;

  • support for the latest RDKit fingerprint generators, which includes new options like count simulation;

  • improved support for CDK fingerprints and new support for the jCompoundMapper fingerprint types;

  • support for Python 3.12 and NumPy 2.0, PEP 517-builds with pyproject.toml, and initial support for type annotations.

Warnings about likely upcoming API changes

I am planning some changes for the chemfp 5.0 release.

Warning

Changing rdkit2fps default fingerprint type to to Morgan

For the chemfp 5.0 release I will likely change the default rdkit2fps fingerprint type from RDKit-Fingerprint (the Daylight-like fingerprints) to the Morgan fingerprint with radius 3. If you currently use the default of RDKit fingerprints then you should add --RDK now so this change won’t affect you. Let me know if this is an issue for you.

Warning

Removing from_* and to_* toolkit helper functions.

The toolkit shortcut functions matching the pattern “from_{format}”, “to_{format}”, “from_{format}_file” and “to_{format}_file” were added in chemfp 4.0. In practice these were inconsistent with other parts of the chemfp API so they were deprecated by 4.1, with a PendingDeprecationWarning suggesting you use the “parse_{format}” and “create_{format}” functions. For example, use parse_smistring() instead of from_smistring().

These old functions now generate a DeprecationWarning and will be removed in chemfp 5.0.

Warning

Removing bitops.byte_difference() and bitops.hex_difference()

The chemfp.bitops module functions byte_difference() and hex_difference() functions were deprecated in favor of byte_xor and hex_xor. Starting with 4.1 these generated a PendingDeprecationWarning. They now generate a DeprecationWarning and will be removed for chemfp 5.0.

Warning

Adding a default delay for progress bars.

Chemfp 4.0 added progress bars, which are great for long-running tasks, but sometimes chemfp can show three or four progress bars in a fraction of a second, where I don’t care to see any progress bars at all. While I’ve gotten into the habit of using progress=False for when I know I don’t care, I think a better solution is to introduce a minimum delay, like 0.5 seconds, before showing any progress bar.

This 4.2 release adds the initial support for that option, like passing progress=0.5, or using the CHEMFP_PROGRESS environment variable, though the default is to always show a progress bar.

Warning

Many object repr strings are likely to change.

As I get feedback about people using the API, and more experience using it myself, I’ve noticed various places where I can improve the repr.

Warning

SearchResult’s “result” will likely change to “out”.

The high-level chemfp.simsearch() function returns a high-level search result object which itself contains a result attribute containing the low-level search search object.

I’ve found that referring to the low-level object as result.result to be very confusing.

With chemfp 4.2 that object is also available as result.out, following the “out” parameter convention used in NumPy. If it proves better I will make the old result attribute generate a PendingDeprecationWarning in chemfp 5.0, for eventual removal by chemfp 5.1 or 5.2.

Warning

The “npz” JSON metadata array format will likely change.

The new simarray functionality has the option to store metadata as a JSON string in a NumPy array, but the JSON schema is different than the JSON used in the simsearch “npz” format . Based on that experience, I plan to change the “npz” JSON format so the two are more consistent.

Simarray

Chemfp has new support for creating the complete set of comparisons between a fingerprint set and itself (“symmetric” or “NxN”), the complete set of comparison between a query fingerprint set and a target fingerprint set (“NxM”) or between a query fingerprint and a target fingerprint set (“N”).

chemfp.simarray()

The primary use case is to generate a NumPy array as input to sci-kit clustering algorithms and other algorithms which require the full (dense) matrix as input. Here is an example using the chemfp.simarray() function:

>>> import chemfp
>>> simarr = chemfp.simarray(targets="benzodiazepine_morgan2.fps.gz", as_distance=True)
benzodiazepine_morgan2.fps.gz: 100%|███████████████| 492k/492k [00:00<00:00, 15.1Mbytes/s]
scores: 100%|██████████████████████████████████████| 76.7M/76.7M [00:01<00:00, 73.0M/s]
>>> simarr
SimarrayResult(metric=<1-Tanimoto distance>, out=<12386x12386 symmetric array of dtype float64>,
query_ids=target_ids=['1688', '1963', '2118', ...], times=<process: 1.05 s, total: 1.11 s>)
>>> simarr.out[:3,:3]
array([[0.        , 0.65      , 0.51923077],
       [0.65      , 0.        , 0.46428571],
       [0.51923077, 0.46428571, 0.        ]])
>>> import sklearn.cluster
>>> agg_complete = sklearn.cluster.AgglomerativeClustering(linkage="complete", metric="precomputed")
>>> agg_complete.fit(simarr.out)
AgglomerativeClustering(linkage='complete', metric='precomputed')

In this case it took about 50ms to load the 12,386 fingerprints and 1.05 seconds to generate the complete distance matrix, computed as 1-Tanimoto. A complete timing breakdown is:

>>> simarr.get_times_description()
'load_targets: 54.84 ms init: 200.03 us process: 1.05 s total: 1.11 s'

with the raw times (in seconds) available as the simarr.times dictionary.

Use save() to save the array in NumPy’s “npy” format, or in a “bin”ary format containing only the raw data bytes:

>>> simarr.save("benzodiazepine.npy")

chemfp simarray command-line tool

On the command-line, the new chemfp simarray subcommand will generate the all-by-all comparison and save the result to a NumPy “npy” file containing:

% chemfp simarray benzodiazepine_morgan2.fps.gz --as-distance \
       --no-lower-triangle -o benzodiazepine_NxN.npy
benzodiazepine_morgan2.fps.gz: 100%|█████| 492k/492k [00:00<00:00, 17.7Mbytes/s]
scores: 100%|███████████████████████████████| 76.7M/76.7M [00:00<00:00, 86.5M/s]
% python
  ...
>>> import numpy as np
>>> arr = np.load("benzodiazepine_NxN.npy")
>>> arr[:3,:3]
array([[0.        , 0.65      , 0.51923077],
       [0.        , 0.        , 0.46428571],
       [0.        , 0.        , 0.        ]])

The --no-lower-triangle (or include_lower_triangle=False in the API) leaves the lower triangle as the zero values. This option improves the overall performance but does not save any space!

The “npy” file contains up to four arrays, though the np.load() only returns the first one, with the comparison values. The second contains metadata about the comparison, as a JSON-encoded 1-element NumPy string array. The third contains the target ids (or simply the ids for an NxN comparison), and the fourth contains the query ids, if relevant. You can use np.load() with a file handle to read these arrays directly, or use chemfp’s new load_simarray():

>>> import chemfp
>>> simarr = chemfp.load_simarray("benzodiazepine_NxN.npy")
>>> simarr
SimarrayFileContent(metric=<1-Tanimoto distance>,
out=<12386x12386 upper-triangular array of dtype float64>,
query_ids=target_ids=['1688', '1963', '2118', ...])
>>> import pprint
>>> pprint.pprint(simarr.get_metadata())
{'dtype': 'float64',
'format': 'multiple',
'matrix_type': 'upper-triangular',
'method': 'Tanimoto as_distance=1',
'metric': {'as_distance': True,
            'is_distance': True,
            'is_similarity': False,
            'name': 'Tanimoto'},
 'metric_description': '1-Tanimoto distance',
 'num_bits': 2048,
 'shape': (12386, 12386)}

See the chemfp.simarray_io documentation for a description of the JSON content.

Finally, the chemfp.simarray() API function creates the entire comparison array in-memory, as well as the chemfp simarray subcommand when generating “npy” output. This may be a problem when computing a large comparison array which may not fit into memory.

Generating large arrays

If you want to create an array which exceeds available memory, the chemfp simarray subcommand has special support for computing 2GB portions of the output array a time, and writing the results to a “bin” file. This file contains the raw data bytes for the array, equivalent to NumPy’s arr.tobytes().

The following example computes the cosine similarity (as 32-bit floats) for 100,000 randomly selected PubChem fingerprints. The raw bytes are saved to pubchem_100K.bin, with additional data about the calculation saved to pubchem_100K_meta.npy:

% chemfp simarray --metric cosine --dtype float32 pubchem_100K.fpb \
      -o pubchem_100K.bin --metadata-output pubchem_100K_meta.npy
scores: 100%|█████████████████| 10.0G/10.0G [01:22<00:00, 122M/s]
% ls -lh pubchem_100K.bin
-rw-r--r--  1 dalke  admin    37G Jun 18 10:09 pubchem_100K.bin

That’s 122 million comparisons per second, or just under 90 seconds for 100Kx100K = 10 billion comparisons, on my 16GB laptop which has 12GB in use for other things.

The “bin” file can be memory-mapped directly into NumPy, if you know the shape and dtype:

>>> import numpy as np
>>> arr = np.memmap("pubchem_100K.bin",
     shape=(100_000, 100_000), dtype=np.float32)
>>> arr[5000:5003,5000:5003]
memmap([[1.        , 0.6666667 , 0.47619048],
        [0.6666667 , 1.        , 0.3809524 ],
        [0.47619048, 0.3809524 , 1.        ]], dtype=float32)

The auxillary “metadata” npy file is the same format as the normal simarray npy file except the first array contains a 0x0 array with the appropriate NumPy data. The JSON shape value in the second array can be used to get the size, followed by the ids in the third array.

If you have the “bin” and auxillary “npy” file then the easiest way to access the data is with chemfp.load_simarray():

>>> import chemfp
>>> simarr = chemfp.load_simarray("pubchem_100K.bin",
...        metadata_source="pubchem_100K_meta.npy")
>>> simarr
SimarrayFileContent(metric=<cosine similarity>, out=<100000x100000
symmetric array of dtype float32>, query_ids=target_ids=[
'783', '1038', '23913', ...])
>>>
>>> import numpy as np
>>> np.count_nonzero(simarr.out)
9807887295

By default is uses a memory-map to read a “bin” file as read-only, while it loads an “npy” into memory. Use the mmap_mode to change these behaviors.

Single fingerprint queries and NxM queries

The simarray() function can also be used to compute all the comparisons between a query fingerprint and a set of target fingerprints, like the following example which computes the Tanimoto similarity between CHEMBL113 and all of ChEMBL 33, then shows the histogram counts (which acts as a log scale graphic):

>>> import chemfp
>>> a = chemfp.load_fingerprints("chembl_33.fpb")
>>> query_fp = a.get_fingerprint_by_id("CHEMBL113")
>>> simarr = chemfp.simarray(query_fp=query_fp, targets=a)
scores: 100%|█████████████████████████████████████| 2.37M/2.37M
[00:00<00:00, 3.99M/s]
>>> simarr
SimarrayResult(metric=<Tanimoto similarity>, out=<2372674 vector of
dtype float64>, query_ids=None, target_ids=['CHEMBL17564',
'CHEMBL4300465', 'CHEMBL4302507', ...], times=<process: 620.04 ms,
total: 623.88 ms>)
>>> import numpy as np
>>> for count, bin in zip(*np.histogram(simarr.out, bins=20)):
...   print(f"{bin:.2f}: {count}")
...
0.00: 190089
0.05: 1682625
0.10: 455922
0.15: 34920
0.20: 5233
0.25: 2074
0.30: 912
0.35: 482
0.40: 222
0.45: 82
0.50: 60
0.55: 40
0.60: 7
0.65: 3
0.70: 2
0.75: 0
0.80: 0
0.85: 0
0.90: 0
0.95: 1

It can also be used to generate the NxM comparison matrix between two sets of fingerprints, like the following which computes the Hamming distances between two randomly selected subsets of size 5,000 and 20,000 (without overlaps), then generates the same sort of histogram:

>>> import chemfp
>>> a = chemfp.load_fingerprints("chembl_33.fpb")
>>> subset1, subset2 = a.train_test_split(
              train_size=5_000, test_size=20_000)
>>> simarr = chemfp.simarray(queries=subset1, targets=subset2,
          metric="Hamming", progress=False)
>>> simarr.get_out_description()
'5000x20000 array of dtype uint16'
>>> import numpy as np
>>> for count, bin in zip(*np.histogram(simarr.out, bins=20)):
...   print(f"{int(bin):4d}: {count}")
...
   0: 277
  14: 10988
  29: 482755
  44: 7157451
  59: 29524500
  74: 36458980
  89: 16786454
 103: 5651194
 118: 2229461
 133: 980313
 148: 396547
 163: 171267
 178: 91759
 193: 33880
 207: 16056
 222: 5979
 237: 1896
 252: 213
 267: 26
 282: 4

This functionality is also available through the simarray subcommand, both for a single query:

% chemfp simarray \
    --hex-query  000000003000000001d414c913a3914380b03a6a1b \
    maccs166.fpb -o maccs.npy

and for a set of queries against targets:

% chemfp simarray --queries 10_queries.fps chembl_33.fpb \
    --as-distance -o jaccard_distance.npy

Metrics and data types

In simarray, the term “metric” refers to the fingerprint comparison method, which may be a similarity (“Tanimoto”, “Dice”, or “cosine”), a distance (“Hamming”), or one of three different “abcd” methods described in the next subsection. There is also an option to convert a similarity into a distance.

Some metrics supports multiple NumPy data types. If neither the metric nor dtype is specified then chemfp will compute the Tanimoto similarity using the float64 datatype. (What C calls a “double”) and if the dtype is “float32” it will use a 32-bit float, which takes half the space with no information less for 4096-bit fingerprints or smaller. (2117/4142 and 2094/4097 are the same float32 value.)

If 4 bytes per score is too high and you are willing to bin similar scores together, use the “uint16” dtype to compute Tanimoto as a scaled unsigned 16-bit integer computed as int(score * 65535).

On the other hand, use “rational64” or “rational32” if you want the exact ratio, stored as a structured data type with pair of respectively uint32 or uint16 values containing the numerator (“p”) and denominator (“q”), where the numerator is the intersection popcount and the denominator is the union popcount. The rational32 dtype uses the same space as float32 but with no information loss.

The Dice similarity also supports the float64 (the default), float32, scale uint16, rational64, and rational32 dtypes.

The cosine similarity only supports float64, float32, and scaled uint16.

The Hamming distance only supports the uint16 dtype.

If the metric is “Tanimoto”, “Dice”, or “cosine” and as_distance=True (in the API) or --as-distance (on the command-line), then the distance value 1-similarity will be computed instead, or the corresponding scaled version.

If the similarity is 0/0 then the float64, float32, and scale uint16 values will be 0, while the distance version wil be 1. The rational similarity will be p=q=0 while the distance will be p=q=n where n is the number of fingerprint bits.

The “abcd” data type

What if you want a different metric?

There are scores to choose from. The paper Holliday, J.D., Hu, C.-Y. and Willett, P. (2002) Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-strings. Combinatorial Chemistry and High Throughput Screening 5(2) pp.155-166 compares 22 different ones, and suggests researches also consider the “Russell/Rao, Baroni-Urbani/Buser, SimpleMatch, Stiles, Ochiai/Cosine, and Kulczynski(2) coefficients.”

The paper has a table which describes those coefficients in terms of “a”, “b”, “c”, and “d”. For example, when expressed as a Python or C expression:

Baroni-Urbani/Buser:  (sqrt(a*d) + a) / (sqrt(a*d) + a + b + c)

What are “a”, “b”, “c”, and “d”? From the paper:

Assume that the bit-strings XP and XQ denote two molecules P and Q, respectively, and that each of these strings contains a total of n bit positions. Assume further that b and c of these n bits are set to one only in XP and only in XQ, respectively, that a of these n bits are set to one in both XP and XQ, and that d of these n bits are not set in either XP and XQ (so that n = a + b + c + d).

In chemfp, this is the “Sheffield” convention, and the corresponding metric stores the comparison in a structured data type with 4 uint16 element named “a”, “b”, “c”, and “d”. Here’s an example:

>>> import chemfp
>>> simarr = chemfp.simarray(targets="distinct.fps", metric="Sheffield", progress=False)
>>> simarr.get_out_description()
'12x12 symmetric array of dtype abcd'
>>> simarr.out.dtype
dtype([('a', '<u2'), ('b', '<u2'), ('c', '<u2'), ('d', '<u2')])

The easiest way to compute the Baroni-Urbani/Buser association coefficient is using NumPy’s own operators:

import numpy as np
def baroni_urbani(abcd_arr):
  a, b, c, d = abcd_arr["a"], abcd_arr["b"], abcd_arr["c"], abcd_arr["d"]
  numerator = np.sqrt(a * d) + a
  return numerator / (numerator + b + c)

which generates the following values:

>>> baroni_urbani(simarr.out)[:4, :4]
array([[1.        , 0.43166688, 0.35355338, 0.37293836],
       [0.43166688, 1.        , 0.4552797 , 0.42      ],
       [0.35355338, 0.4552797 , 1.        , 0.35606548],
       [0.37293836, 0.42      , 0.35606548, 1.        ]], dtype=float32)

No doubt there are faster methods, including clever ones which mutate the array in-place. That exercise is left to the reader.

chemfp’s three “abcd” metrics

The previous section referenced a paper by Holliday et al. which used what chemfp calls the “Sheffield” convention for the “a”, “b”, “c”, and “d” values. That is one of three conventions you are are likely to come across in cheminformatics papers.

The most common is the “Willett” convention:

  • a is the number of bits “on” in fingerprint A

  • b is the number of bits “on” in fingerprint B

  • c is the number of bits “on” in both fingerprints A and B

  • d is the number of bits “off” in both fingerprints A and B

This is used by Sheffield publications between roughly the mid 1980s to roughly the mid 2000s, especially those by Peter Willett, as well as those external to Sheffield. (chemfp internally uses this convention.)

The next most common is the “Sheffield” convention, which is:

  • a is the number of bits “on” in both fingerprints A and B

  • b is the number of bits “on” in fingerprint A which are “off” in fingerprint B

  • c is the number of bits “on” in fingerprint B which are “off” in fingerprint A

  • d is the number of bits “off” in both fingerprints A and B

The “Sheffield” convention was used by papers from the Sheffield group up until the mid 1980s, and in most Sheffield papers past about 2008, or starting in 2002 if John Holliday is a co-author. The naming is, alas, confusing because Willett was at Sheffield and Willett’s later papers switched from “Willett” convention to “Sheffield” convention. I chose it because it best reflects the impact of Willett’s work promoting molecular similarity during the 1980s and 1990s.

In third place, and far less common, is the “Daylight” convention, which comes from John Bradshaw, is described in the Daylight theory page on “Fingerprints - Screening and Similarity”, and is used by the Daylight software:

  • a is the number of bits “on” in fingerprint A which are “off” in fingerprint B

  • b is the number of bits “on” in fingerprint B which are “off” in fingerprint A

  • c is the number of bits “on” in both fingerprints A and B

  • d is the number of bits “off” in both fingerprints A and B

The low-level simarray API

The high-level API is built on a lower-level API based around a SimarrayProcessor, which is created via one of three class methods. Once created, use next(n) to process at least n comparisons (it may process more depending on the size of the matrix and number of threads; the function returns the number actually processed) or use process_all() to process all remaining comparisons, along with a progress bar:

>>> import chemfp
>>> a = chemfp.load_fingerprints("distinct.fps")
>>> import chemfp.search
>>> processor = chemfp.search.SimarrayProcessor.from_symmetric(a, metric="Hamming")
>>> processor.num_to_process
78
>>> processor.next(5)
12
>>> processor.process_all()
scores: 100%|█████████████████████████| 66/66 [00:00<00:00, 1048576.00/s]

The three class constructors are SimarrayProcessor.from_symmetric() (shown in the example), SimarrayProcessor.from_NxM() and SimarrayProcessor.from_query_fp().

I do not expect most people to use the low-level API. If you do, let me know as I’m curious to know what you think.

RDKit fingerprint generation

Chemfp’s support for the RDKit Morgan, AtomPair, Torsion, and RDKit-Fingerprint fingerprint types have changed.

Important

The default Morgan fingerprint radius is now 3 instead of 2!

Since around 2020 the RDKit toolkit has been migrating how fingerprints are generated, from a function-based API to a more object-oriented one based around fingerprint “generators”, with new features only available through the new API.

The fingerprints from the new generators are not exactly the same as the old, in part because a few of the more obscure parameters (like tgtDensity) have been removed, and a few others renamed, but the underlying methods are the same.

Chemfp 4.2 support both old-style and new-style fingerprint generation by using different version numbers. For example, the “RDKit-Morgan/1” fingerprint type generates old-style fingerprints while “RDKit-Morgan/2” generates new-style fingerprints. If the version is unspecified then chemfp will use the most recent version available in your version of RDKit.

Count simulation

The most interesting new feature in RDKit’s fingerprint generator is “count simulation”, which lets a binary fingerprint capture some of the abilities of a count fingerprint by setting additional bits based on the corresponding feature count.

For example, by default RDKit’s Morgan fingerprint sets one and only one bit if a feature is present, no matter how many times it is present. Consider this example:

>>> from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
>>> morgan3_gen = GetMorganGenerator()
>>> DataStructs.TanimotoSimilarity(morgan3_gen.GetFingerprint(mol1),
...                                morgan3_gen.GetFingerprint(mol2))
0.3684210526315789
>>> DataStructs.TanimotoSimilarity(morgan3_gen.GetCountFingerprint(mol1),
...                                morgan3_gen.GetCountFingerprint(mol2))
0.5172413793103449

Careful examination shows the binary fingerprint only considers feature presence or absence, like in the following where feature 1873 is present 5 times in the Morgan count fingerprint, but only bit 1873 is set in the corresponding binary fingerprint:

>>> morgan3_gen.GetCountFingerprint(mol1).GetNonzeroElements()
{389: 1, 392: 1, 496: 2, 888: 1, 1088: 3, 1171: 1, 1199: 2, 1380: 1,
1457: 1, 1750: 2, 1804: 1, 1873: 5, 1947: 1}
>>> list(morgan3_gen.GetFingerprint(mol1).GetOnBits())
[389, 392, 496, 888, 1088, 1171, 1199, 1380, 1457, 1750, 1804, 1873, 1947]
>>> len(morgan3_gen.GetFingerprint(mol1).GetOnBits())
13

Count simulation lets us include some of the count information in the binary fingerprint, by setting multiple bits based on the counts. I’ll create a Morgan fingerprint generator with count simulation enabled, and with specified countBounds of “1, 4”. This says that if the count is at least 1 then set the first bit, and if it’s at least 4 then set the second (consecutive) bit. Since only one feature is present 4-or-more times, the number of expected bits should increase by one, and get us a bit closer to the count Tanimoto:

>>> count_sim_gen = GetMorganGenerator(countSimulation=True, countBounds=[1, 4])
>>> DataStructs.TanimotoSimilarity(count_sim_gen.GetFingerprint(mol1),
...                                count_sim_gen.GetFingerprint(mol2))
0.4
>>> list(count_sim_gen.GetFingerprint(mol1).GetOnBits())
[128, 294, 350, 712, 778, 784, 866, 992, 1452, 1560, 1698, 1699, 1776, 1846]
>>> len(count_sim_gen.GetFingerprint(mol1).GetOnBits())
14

You may have noticed how the bit values changed. This is due to how count emulation works. It generate the bits in a smaller fingerprint space (scaled by the number of bounds, i.e., bits to set), then scales them back up using the bounds to determine which bits to set. The implementation works like this, although in C++:

>>> scaled_fp_gen = GetMorganGenerator(fpSize=2048//2)
>>> counts = scaled_fp_gen.GetCountFingerprint(mol1).GetNonzeroElements()
>>> counts
{64: 3, 147: 1, 175: 2, 356: 1, 389: 1, 392: 1, 433: 1, 496: 2, 726: 2, 780: 1, 849: 5, 888: 1, 923: 1}
>>> bits = set()
>>> for bitno, count in counts.items():
...   if count >= 1: bits.add(bitno*2)
...   if count >= 4: bits.add(bitno*2+1)
...
>>> sorted(bits)
[128, 294, 350, 712, 778, 784, 866, 992, 1452, 1560, 1698, 1699, 1776, 1846]

If countSimulation is enabled but countBounds is not set then the default is “1,2,4,8”, which results in a pretty close match to the actual count fingerprint of 0.517 computed earlier:

>>> morgan3_count_sim_gen = GetMorganGenerator(countSimulation=True)
>>> DataStructs.TanimotoSimilarity(morgan3_count_sim_gen.GetFingerprint(mol1),
...                                morgan3_count_sim_gen.GetFingerprint(mol2))
0.52

For an RDKit blog post about the topic, see: https://greglandrum.github.io/rdkit-blog/posts/2021-07-06-simulating-counts.html

RDKit-Fingerprint type changes

The RDKit-Fingerprint (RDKit’s version of the Daylight path, with tree support) has changed from version 2 to version 3.

The “RDKit-Fingerprint/3” options are:

  • fpSize: number of bits in the fingerprint (default: 2048)

  • minPath: minimum number of bonds (default: 1)

  • maxPath: maximum number of bonds (default: 7)

  • countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)

  • countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)

  • nBitsPerFeature: number of bits to set for each path feature (default: 2)

  • useHs: include information about the number of hydrogens on each atom? (default: True)

  • branchedPaths: include both branched and unbranched paths (default: True)

  • useBondOrder: use both bond orders in the path hashes (default: True)

The old nBitsPerHash has been renamed nBitsPerFeature, to match the RDKit API change, and the count simulation parameters are new.

The default type string is “RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7”, where the remaining parameters, if not specified, have their default value.

This is the default fingerprint type when you use “rdkit2fps”. It is also available by specifying ‑‑RDK or ‑‑RDK/3. The older fingerprint type is available with ‑‑RDK/2. Alternatively you can specify the type string using ‑‑type.

In the Python API this fingerprint type object is available as chemfp.rdkit_toolkit.rdk or chemfp.rdkit_toolkit.rdk_v3, with the older type available as chemfp.rdkit_toolkit.rdk_v2.

RDKit-Morgan type changes

The RDKit-Morgan fingerprint (RDKit’s version of the ECFP and FCFP circular fingerprints) has changed from version 1 to version 2.

Important

The default radius has changed from radius=2 to radius=3, to reflect the default value in GetMorganGenerator()! Previously RDKit did not have a suggested default, so I choose radius=2.

The “RDKit-Morgan/2” options are:

  • fpSize: number of bits in the fingerprint (default: 2048)

  • radius: radius for the Morgan algorithm (default: 3) (was 2 in v1!)

  • useFeatures: use chemical-feature invariants (default: 0)

  • countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)

  • countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)

  • includeChirality: include chirality information in the bond invariants (default: 0)

  • useBondTypes: (default: 1)

  • includeRingMembership: if 1, include ring membership in the atom invariants (default: 1)

  • includeRedundantEnvironments: if 1, include redundant environments in the fingerprint (default: 0)

The old useChirality has been renamed to includeChirality. The includeRingMembership is new (version 1 always included ring membership), as are the count simulation options.

The new default type string is “RDKit-Morgan/2 fpSize=2048 radius=3 useFeatures=0”, where the remaining parameters, if not specified, have their default value. The fpSize and radius parameters have changed position from version 1. The useFeatures is always included so people can tell if this is an ECFP-like or FCFP-like fingerprint. If useFeatures is True then chemfp will use RDKit’s rdFingerprintGenerator.AtomInvariantsGenerator() as the atom invariant generator.

To specify the Morgan/2 fingerprint type (with radius=3) with rdkit2fps use ‑‑morgan or ‑‑morgan3. Use ‑‑morgan1, ‑‑morgan2, ‑‑morgan3, or ‑‑morgan4 to specify a radius of 1, 2, 3, or 4. Append a /2 to be specific about the version, e.g. ‑‑morgan/2.

To use the older Morgan/1 fingerprint type, append /1, that is, use ‑‑morgan/1 for the Morgan/1 fingerprints (with radius=2) and use ‑‑morgan1/1, ‑‑morgan2/1, ‑‑morgan3/1, or ‑‑morgan4/1 to use a Morgan/1 fingerprint with radius of 1, 2, 3, or 4, respectively.

In the Python API these fingerprint type objects are available in the chemfp.rdkit_toolkit module as:

RDKit-AtomPair type changes

The RDKit-AtomPair has changed from version 2 to version 3.

The “RDKit-AtomPair/3” options are:

  • fpSize: number of bits in the fingerprint (default: 2048)

  • minDistance:minimum bond distance for two atoms to be considered a pair (default: 1)

  • maxDistance: maximum bond distance for two atoms to be considered a pair (default: 30)

  • countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)

  • countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)

  • includeChirality: if 1, chirality will be used in the atom invariants (default: 0)

  • use2D: if 1, use a 2D distance matrix, if 0 use the 3D matrix from the first set of conformers, or return an empty fingerprint if no conformers (default: 1)

The AtomPair/2 minLength and maxLength have been renamed to minDistance and maxDistance in AtomPair/3.

The AtomPair/2 nBitsPerEntry setting is no longer available in AtomPair/3. The old default was 4. This can be emulated with count simulation.

The new default type string is “RDKit-AtomPair/3 fpSize=2048 minDistance=1 maxDistance=30 countSimulation=1”, where the remaining parameters, if not specified, have their default value.

To specify the AtomPair/3 fingerprint type with rdkit2fps use ‑‑pair or ‑‑pair/3. For the older AtomPair/2 fingerprint type, use ‑‑pair/2.

In the Python API these fingerprint type objects are available in the chemfp.rdkit_toolkit module as pair for the most recent version, or pair_v3 for AtomPair/3 and pair_v2 for AtomPair/2.

RDKit-Torsion type changes

The RDKit-Torsion (topological torsion fingerprints) has changed from version 3 to version 4.

The “RDKit-Torsion/4” options are:

  • fpSize: number of bits in the fingerprint (default: 2048)

  • torsionAtomCount: the number of atoms to include in the ‘torsions’ (default: 4)

  • countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 1)

  • countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)

  • includeChirality: if 1, include chirality in the atom invariants (default: 0)

  • onlyShortestPaths: if 1, only include the shortest paths between the start and end atoms, not all paths (default: 0)

The targetSize parameter is version 3 is now torsionAtomCount. The count simulation and onlyShortestPaths options are new. The nBitsPerEntry from version 3 can be implemented with count simulation. I think the old nBitsPerEntry=4 is equivalent to countBounds with “1,2,4,8” and other nBitsPerEntry values equivalent to “1,2,3,4,…,$nBitsPerEntry”.

The new default type string is “RDKit-Torsion/4 fpSize=2048 torsionAtomCount=4 countSimulation=1”, where the remaining parameters, if not specified, have their default value.

To specify the Torsion/4 fingerprint type with rdkit2fps use ‑‑torsion or ‑‑torsion/4. For the older Torsion/3 fingerprint type, use ‑‑torsion/3.

In the Python API these fingerprint type objects are available in the chemfp.rdkit_toolkit module as torsion or torsion_v4 for Torsion/4 and torsion_v3 for Torsion/3.

CDK updates

Chemfp now supports cdk 2.9, which includes the improvements to the PubChem fingerprint generator. There’s also a new hydrogens reader argument for the SMILES and SDF readers, to specify that implicit hydrogens should be made explicit, or vice versa.

“hydrogens” reader argument

Many of the CDK fingerprint types require all hydrogens be explicit hydrogens (that is, present as atoms in the molecular graph) while a couple require all hydrogens be implicit (that is, as hydrogen counts on some other atom.) To figure out which is which, read the CDK documentation or see the guidance from cdk2fps --help.

On the other hand, the CDK SMILES and SDF parser leave the hydrogens in the form present in the input, so the SMILES “C” has four implicit hydrogens while “[H][CH]([H])[H]” has three explicit hydrogens and one implicit hydrogen on the carbon.

Before chemfp 4.2 you needed to prepare the inputs SMILES or SDF to have the hydrogens in the correct form.

Chemfp 4.2 add the hydrogens reader argument, which describes how to process the hydrogens.

The default of as-is leaves them in the existing state.

The value make-explicit converts all implicit hydrogens to explicit form.

The value make-implicit converts all explicit hydrogens to implicit form.

The value make-nonchiral-implicit converts all explicit hydrogens to implicit form except for chiral hydrogens.

The following examples use the chemfp translate <chemfp_translate> subcommand to convert an input structure file to an output structure file using the CDK. By default reads and writes in SMILES format. I’ll have it process the SMILES for methane which has a mixture of hydrogen representations, to show how “make-explicit” and “make-implicit” do:

% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=as-is
C([H])([H])[H] methane
% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=make-explicit
C([H])([H])([H])[H] methane
% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=make-implicit
C methane

The following shows how the explicit chiral hydrogen is not converted to an implicit hydrogen when using the “make-nonchiral-implicit” setting:

% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk
[C@]([H])(Cl)(F)N example
% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk -R hydrogens=make-implicit
[C@H](Cl)(F)N example
% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk -R hydrogens=make-nonchiral-implicit
C([H])(Cl)(F)N example

The hydrogens reader argument is also available through the API:

>>> from chemfp.cdk_toolkit import parse_smistring
>>> parse_smistring("[H][CH2][H]").getAtomCount()
3
>>> parse_smistring("[H][CH2][H]", hydrogens="make-implicit").getAtomCount()
1
>>> parse_smistring("[H][CH2][H]", hydrogens="make-explicit").getAtomCount()
5

and through the reader_args parameter for the reader API:

>>> import chemfp
>>> for id, fp in chemfp.read_molecule_fingerprints(
...       "CDK-Daylight", "small.smi", reader_args={"hydrogens": "make-explicit"}):
...   print(id, fp)
     <many lines omitted>

CDK PubChem fingerprints

The PubChemFingerprinter CDK 2.9 no longer requires explicit hydrogens and it uses a ring set definition that better matches the CACTVS substructure keys.

The chemfp fingerprint type string for the new version is “CDK-Pubchem/2.9”, instead of the old “CDK-Pubchem/2.0”.

jCompoundMapper

jCompoundMapper is a Java package for fingerprint generation dating from around 2010. The publication is: Hinselmann, G., Rosenbaum, L., Jahn, A. et al. jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints. J Cheminform 3, 3 (2011). https://doi.org/10.1186/1758-2946-3-3

The paper “Effectiveness of molecular fingerprints for exploring the chemical space of natural products” by Boldini, Ballabio, Consonni, Todeschini, Grisoni, and Sieber, J. Cheminform. (2024) 16:35 https://doi.org/10.1186/s13321-024-00830-3 benchmarked “20 molecular fingerprints from four different sources” on natural products and concluded that jCompoundMapper’s All Shortest Paths (ASP) and Local Path Environments (LSTAR) fingerprints were promising.

I therefore decided to include jCompoundMapper fingerprint types as part of the chemfp 4.2 release, at the very least to help others replicate the results.

Integration is tricky. To start, you need to get a copy of one of the jCompoundMapper jar files and put the filename it on your CLASSPATH after the CDK jar. The package is so old that it’s not compatible with the modern CDK. Thankfully, CDK has a startup option to use a backwards compatible implementation, which must be specified when the JVM starts. Chemfp will set that option if it sees either of the two expected jCompoundMapper jar filenames on the CLASSPATH. However, the standard CDK fingerprints don’t work when in backwards-compatibility mode.

For the full setup details see the chemfp.jcmapper_types documentation.

The suppported fingerprint types and default type strings are:

  • Depth-First Search (DFS) - “jCMapper-DFS hashsize=4096 searchDepth=7 atomLabel=ELEMENT_NEIGHBOR”

  • All Shortest Paths (ASP): - “jCMapper-ASP hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR”

  • Local Path Environments (LSTAR): - “jCMapper-LSTAR hashsize=4096 searchDepth=6 atomLabel=ELEMENT_NEIGHBOR”

  • Topological Molprint-like fingerprints (RAD2D) - “jCMapper-RAD2D hashsize=4096 searchDepth=3 atomLabel=ELEMENT_SYMBOL”

  • 2-point topological pharmacophore pairs (PH2) - “jCMapper-PH2 hashsize=4096 searchDepth=8”

  • 3-point topological pharmacophore triples (PH3) - “jCMapper-PH3 hashsize=4096 searchDepth=5”

  • 2-point topological atom type pairs (AP2D) - “jCMapper-AP2D hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR”

  • 3-point topological atom type triplets (AT2D)w: - “jCMapper-AT2D hashsize=4096 searchDepth=5 atomLabel=ELEMENT_NEIGHBOR”

The generated fingerprint is hashsize bits long, which must be a positive integer. Most fingerprint types takes a searchDepth which must be a non- negative integer. It specifies the maximum path length, circular environment radius, or shell radius to consider.

Most of the fingerprint types support alternative ways to assign a label to a given atom type, based on different atom and extended atom properties, which in turn affects fingerprint generation. The supported atomLabel methods are:

  • “CDK_ATOM_TYPES”: CDK atom types (eg, ‘C.sp2’, ‘O.minus’)

  • “ELEMENT_SYMBOL”: element symbol (eg, ‘C’, ‘O’)

  • “ELEMENT_NEIGHBOR”: element and number of heavy atom neighbors (eg, ‘C.2’)

  • “ELEMENT_NEIGHBOR_RING”: element, ring type, and number of heavy atom neighbors (eg, ‘C.a.2’)

  • “DAYLIGHT_INVARIANT”: “Atomic number, number of heavy atom neighbors, valency minus the number of connected hydrogens, atomic mass, atomic charge, number of connected hydrogens” (eg, ‘6.2.3.12.0.1’ for a carbon in a benzole ring)

  • “DAYLIGHT_INVARIANT_RING”: DAYLIGHT_INVARIANT followed by a flag if the atom is in a ring (eg, ‘6.2.3.12.0.1.1’)

With cdk2fps use --type to specify the type string, like:

cdk2fps --type jCMapper-DFS dataset.smi -o dataset.fps

Run cdk2fps --help-jcmapper on the command-line for more examples.

In the chemfp API use chemfp.get_fingerprint_type(), like the following two examples:

import chemfp
fptype = chemfp.get_fingerprint_type("jCMapper-LSTAR hashsize=1024"
# or
fptype = chemfp.get_fingerprint_type("jCMapper-LSTAR", {"hashsize": 1024})

Modern Python

Behind the scenes there was a lot of work to bring chemfp up to date with modern Python.

This release includes support for Python 3.12, which mostly meant updating to a version of Cython which supported that Python version. Python 3.12 warned me about a number of places where I didn’t quote strings correctly, which was nice. The biggest change was that Python 3.12’s datetime module now requires timezone information, while earlier Python versions didn’t support a built-in timezone. While I fixed this for now, for the chemfp 5.0 release I will likely change the FPS format to support local time in the date, rather than insist on UTC.

I have also fixed a few warnings from Pandas about upcoming API changes, like switching to “iloc” and some strange interaction causing the Cython-generated diversity interface to warn about upcoming copy-on-write changes. (I say “strange” because I could not reproduce it in regular Cpython.)

NumPy 2.0 came out just before the chemfp 4.2 release. The only issues were a couple of places where the repr() of a NumPy string array changed its form, which caused some of the simarray regression tests to fail. (The simarray result repr includes the first few query and target ids, as a hint for what the result contains.)

I’ve added initial support for type annotations. I don’t use them myself so would like feedback.

Finally, I switched the build system from the “legacy” interface to “PEP 517” interface. This turned out to be rather more difficult then I expected because, I think, the people who designed PEP 517 support for setuptools didn’t have much experience with building Python/C extensions, or for working offline like I do.

The changes only affect people building or testing from a source release. I’ve updated the README with details about the issues and how to resolve possible problems.

Other changes

  • Passing a float or integer to the progress argument now specifies the initial delay before showing a progress bar. (As before, True means to always use the progress bar, False means to never use the progress bar, and if it’s a callable then it must implement the tqdm.tqdm function API.)

  • The initial delay can be specified by the CHEMFP_PROGRESS environment variable either as (finite) float or integer, to mean the delay in seconds, or as “on” and “off” to always enable or disable the progress bar. The CHEMP_PROGRESSBAR environment variable is still supported. If “1” (on) or “0” (off) it overrides the CHEMFP_PROGRESS setting.

  • The simsearch command has a new --ordering option, to specify the array order. It can be “increasing-score”, “decreasing-score”, “increasing-score-plus”, “decreasing-score-plus”, and “reverse”. The “*-plus” options break ties consistently, based on index or id, depending on the search type.

  • The Butina --times now also reports the sort time, if needed.

  • The MaxMin output #type line now uses num-initial-picks instead of --num-initial-picks.

  • The simsearch “npz” JSON metadata array now also includes the array shape.

  • Added a suppress_log_output() context manager to the toolkit APIs, to disable the toolkit-specific warning messages sent to the underlying C API. It is used like:

with tk.suppress_log_output():
   tk.parse_smistring(smi)  # may generate a warning message.
  • The high-level chemfp.simsearch() function and the lower-level chemfp.search methods now take a num_threads parameter, to specify directly the number of threads to use. If None (the default), this uses the value returned by chemfp.get_num_threads().

  • Fixed spherex --dise-references so it actually works.

  • Replaced the function chemfp.get_max_threads() (documented as unusable) with chemfp.get_omp_num_threads(), which makes an OpenMP team and returns the number of threads OpenMP actually creates, or 1 if OpenMP is not available.

  • Added the new arena method get_popcount_offsets(). If there is a popcount index (which means the fingerprints are ordered by popcount) then this returns an array.array(“I”) of offsets, where the fingerprints with popcount p are at index i where arr[p] <= i < arr[p+1]. If there is no index, this returns None.

  • Added SearchResults.to_numpy_array() to get the similarity matrix for NxN and NxM search results as a dense NumPy array. The returned matrix has the same shape, (#queries, #targets) as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.

  • Added Marvin support for RDKit in parse_molecule(), parse_id_and_molecule(), create_string() and create_bytes(), by using the “mrv” format.

  • FPB id lookup now supports negative indexing.