What’s new in chemfp 4.2¶
Released 10 July 2024.
The main additions to chemfp 4.2 are:
the “simarray” methods to generate a complete (dense) all-by-all NumPy matrix using several comparison metrics;
support for the latest RDKit fingerprint generators, which includes new options like count simulation;
improved support for CDK fingerprints and new support for the jCompoundMapper fingerprint types;
support for Python 3.12 and NumPy 2.0, PEP 517-builds with pyproject.toml, and initial support for type annotations.
Warnings about likely upcoming API changes¶
I am planning some changes for the chemfp 5.0 release.
Warning
Changing rdkit2fps default fingerprint type to to Morgan
For the chemfp 5.0 release I will likely change the default rdkit2fps
fingerprint type from RDKit-Fingerprint (the Daylight-like
fingerprints) to the Morgan fingerprint with radius 3. If you
currently use the default of RDKit fingerprints then you should add
--RDK
now so this change won’t affect you. Let me know if this is
an issue for you.
Warning
Removing from_*
and to_*
toolkit helper functions.
The toolkit shortcut functions matching the pattern “from_{format}”,
“to_{format}”, “from_{format}_file” and “to_{format}_file” were added
in chemfp 4.0. In practice these were inconsistent with other parts
of the chemfp API so they were deprecated by 4.1, with a
PendingDeprecationWarning suggesting you use the “parse_{format}” and
“create_{format}” functions. For example, use parse_smistring()
instead of from_smistring()
.
These old functions now generate a DeprecationWarning and will be removed in chemfp 5.0.
Warning
Removing bitops.byte_difference() and bitops.hex_difference()
The chemfp.bitops
module functions byte_difference() and
hex_difference() functions were deprecated in favor of byte_xor
and hex_xor
. Starting
with 4.1 these generated a PendingDeprecationWarning. They now
generate a DeprecationWarning and will be removed for chemfp 5.0.
Warning
Adding a default delay for progress bars.
Chemfp 4.0 added progress bars, which are great for long-running
tasks, but sometimes chemfp can show three or four progress bars in a
fraction of a second, where I don’t care to see any progress bars at
all. While I’ve gotten into the habit of using progress=False
for
when I know I don’t care, I think a better solution is to introduce a
minimum delay, like 0.5 seconds, before showing any progress bar.
This 4.2 release adds the initial support for that option, like
passing progress=0.5
, or using the CHEMFP_PROGRESS
environment
variable, though the default is to always show a progress bar.
Warning
Many object repr strings are likely to change.
As I get feedback about people using the API, and more experience using it myself, I’ve noticed various places where I can improve the repr.
Warning
SearchResult’s “result” will likely change to “out”.
The high-level chemfp.simsearch()
function returns a high-level
search result object which itself contains a result
attribute
containing the low-level search search object.
I’ve found that referring to the low-level object as result.result
to be very confusing.
With chemfp 4.2 that object is also available as result.out
,
following the “out” parameter convention used in NumPy. If it proves
better I will make the old result
attribute generate a
PendingDeprecationWarning in chemfp 5.0, for eventual removal by
chemfp 5.1 or 5.2.
Warning
The “npz” JSON metadata array format will likely change.
The new simarray functionality has the option to store metadata as a JSON string in a NumPy array, but the JSON schema is different than the JSON used in the simsearch “npz” format . Based on that experience, I plan to change the “npz” JSON format so the two are more consistent.
Simarray¶
Chemfp has new support for creating the complete set of comparisons between a fingerprint set and itself (“symmetric” or “NxN”), the complete set of comparison between a query fingerprint set and a target fingerprint set (“NxM”) or between a query fingerprint and a target fingerprint set (“N”).
chemfp.simarray()¶
The primary use case is to generate a NumPy array as input to sci-kit
clustering algorithms and other algorithms which require the full
(dense) matrix as input. Here is an example using the
chemfp.simarray()
function:
>>> import chemfp
>>> simarr = chemfp.simarray(targets="benzodiazepine_morgan2.fps.gz", as_distance=True)
benzodiazepine_morgan2.fps.gz: 100%|███████████████| 492k/492k [00:00<00:00, 15.1Mbytes/s]
scores: 100%|██████████████████████████████████████| 76.7M/76.7M [00:01<00:00, 73.0M/s]
>>> simarr
SimarrayResult(metric=<1-Tanimoto distance>, out=<12386x12386 symmetric array of dtype float64>,
query_ids=target_ids=['1688', '1963', '2118', ...], times=<process: 1.05 s, total: 1.11 s>)
>>> simarr.out[:3,:3]
array([[0. , 0.65 , 0.51923077],
[0.65 , 0. , 0.46428571],
[0.51923077, 0.46428571, 0. ]])
>>> import sklearn.cluster
>>> agg_complete = sklearn.cluster.AgglomerativeClustering(linkage="complete", metric="precomputed")
>>> agg_complete.fit(simarr.out)
AgglomerativeClustering(linkage='complete', metric='precomputed')
In this case it took about 50ms to load the 12,386 fingerprints and 1.05 seconds to generate the complete distance matrix, computed as 1-Tanimoto. A complete timing breakdown is:
>>> simarr.get_times_description()
'load_targets: 54.84 ms init: 200.03 us process: 1.05 s total: 1.11 s'
with the raw times (in seconds) available as the simarr.times
dictionary.
Use save()
to save the array in NumPy’s
“npy” format, or in a “bin”ary format containing only the raw data
bytes:
>>> simarr.save("benzodiazepine.npy")
chemfp simarray command-line tool¶
On the command-line, the new chemfp simarray subcommand will generate the all-by-all comparison and save the result to a NumPy “npy” file containing:
% chemfp simarray benzodiazepine_morgan2.fps.gz --as-distance \
--no-lower-triangle -o benzodiazepine_NxN.npy
benzodiazepine_morgan2.fps.gz: 100%|█████| 492k/492k [00:00<00:00, 17.7Mbytes/s]
scores: 100%|███████████████████████████████| 76.7M/76.7M [00:00<00:00, 86.5M/s]
% python
...
>>> import numpy as np
>>> arr = np.load("benzodiazepine_NxN.npy")
>>> arr[:3,:3]
array([[0. , 0.65 , 0.51923077],
[0. , 0. , 0.46428571],
[0. , 0. , 0. ]])
The --no-lower-triangle
(or include_lower_triangle=False
in
the API) leaves the lower triangle as the zero values. This option
improves the overall performance but does not save any space!
The “npy” file contains up to four arrays, though the np.load()
only returns the first one, with the comparison values. The second
contains metadata about the comparison, as a JSON-encoded 1-element
NumPy string array. The third contains the target ids (or simply the
ids for an NxN comparison), and the fourth contains the query ids, if
relevant. You can use np.load()
with a file handle to read these
arrays directly, or use chemfp’s new load_simarray()
:
>>> import chemfp
>>> simarr = chemfp.load_simarray("benzodiazepine_NxN.npy")
>>> simarr
SimarrayFileContent(metric=<1-Tanimoto distance>,
out=<12386x12386 upper-triangular array of dtype float64>,
query_ids=target_ids=['1688', '1963', '2118', ...])
>>> import pprint
>>> pprint.pprint(simarr.get_metadata())
{'dtype': 'float64',
'format': 'multiple',
'matrix_type': 'upper-triangular',
'method': 'Tanimoto as_distance=1',
'metric': {'as_distance': True,
'is_distance': True,
'is_similarity': False,
'name': 'Tanimoto'},
'metric_description': '1-Tanimoto distance',
'num_bits': 2048,
'shape': (12386, 12386)}
See the chemfp.simarray_io
documentation for a description of
the JSON content.
Finally, the chemfp.simarray()
API function
creates the entire comparison array in-memory, as well as the
chemfp simarray subcommand when generating
“npy” output. This may be a problem when computing a large comparison
array which may not fit into memory.
Generating large arrays¶
If you want to create an array which exceeds available memory, the
chemfp simarray subcommand has special
support for computing 2GB portions of the output array a time, and
writing the results to a “bin” file. This file contains the raw data
bytes for the array, equivalent to NumPy’s arr.tobytes()
.
The following example computes the cosine similarity (as 32-bit
floats) for 100,000 randomly selected PubChem fingerprints. The raw
bytes are saved to pubchem_100K.bin
, with additional data about
the calculation saved to pubchem_100K_meta.npy
:
% chemfp simarray --metric cosine --dtype float32 pubchem_100K.fpb \
-o pubchem_100K.bin --metadata-output pubchem_100K_meta.npy
scores: 100%|█████████████████| 10.0G/10.0G [01:22<00:00, 122M/s]
% ls -lh pubchem_100K.bin
-rw-r--r-- 1 dalke admin 37G Jun 18 10:09 pubchem_100K.bin
That’s 122 million comparisons per second, or just under 90 seconds for 100Kx100K = 10 billion comparisons, on my 16GB laptop which has 12GB in use for other things.
The “bin” file can be memory-mapped directly into NumPy, if you know the shape and dtype:
>>> import numpy as np
>>> arr = np.memmap("pubchem_100K.bin",
shape=(100_000, 100_000), dtype=np.float32)
>>> arr[5000:5003,5000:5003]
memmap([[1. , 0.6666667 , 0.47619048],
[0.6666667 , 1. , 0.3809524 ],
[0.47619048, 0.3809524 , 1. ]], dtype=float32)
The auxillary “metadata” npy file is the same format as the normal
simarray npy file except the first array contains a 0x0 array with the
appropriate NumPy data. The JSON shape
value in the second array
can be used to get the size, followed by the ids in the third array.
If you have the “bin” and auxillary “npy” file then the easiest way to
access the data is with chemfp.load_simarray()
:
>>> import chemfp
>>> simarr = chemfp.load_simarray("pubchem_100K.bin",
... metadata_source="pubchem_100K_meta.npy")
>>> simarr
SimarrayFileContent(metric=<cosine similarity>, out=<100000x100000
symmetric array of dtype float32>, query_ids=target_ids=[
'783', '1038', '23913', ...])
>>>
>>> import numpy as np
>>> np.count_nonzero(simarr.out)
9807887295
By default is uses a memory-map to read a “bin” file as read-only,
while it loads an “npy” into memory. Use the mmap_mode
to change
these behaviors.
Single fingerprint queries and NxM queries¶
The simarray()
function can also be used to compute all the
comparisons between a query fingerprint and a set of target
fingerprints, like the following example which computes the Tanimoto
similarity between CHEMBL113 and all of ChEMBL 33, then shows the
histogram counts (which acts as a log scale graphic):
>>> import chemfp
>>> a = chemfp.load_fingerprints("chembl_33.fpb")
>>> query_fp = a.get_fingerprint_by_id("CHEMBL113")
>>> simarr = chemfp.simarray(query_fp=query_fp, targets=a)
scores: 100%|█████████████████████████████████████| 2.37M/2.37M
[00:00<00:00, 3.99M/s]
>>> simarr
SimarrayResult(metric=<Tanimoto similarity>, out=<2372674 vector of
dtype float64>, query_ids=None, target_ids=['CHEMBL17564',
'CHEMBL4300465', 'CHEMBL4302507', ...], times=<process: 620.04 ms,
total: 623.88 ms>)
>>> import numpy as np
>>> for count, bin in zip(*np.histogram(simarr.out, bins=20)):
... print(f"{bin:.2f}: {count}")
...
0.00: 190089
0.05: 1682625
0.10: 455922
0.15: 34920
0.20: 5233
0.25: 2074
0.30: 912
0.35: 482
0.40: 222
0.45: 82
0.50: 60
0.55: 40
0.60: 7
0.65: 3
0.70: 2
0.75: 0
0.80: 0
0.85: 0
0.90: 0
0.95: 1
It can also be used to generate the NxM comparison matrix between two sets of fingerprints, like the following which computes the Hamming distances between two randomly selected subsets of size 5,000 and 20,000 (without overlaps), then generates the same sort of histogram:
>>> import chemfp
>>> a = chemfp.load_fingerprints("chembl_33.fpb")
>>> subset1, subset2 = a.train_test_split(
train_size=5_000, test_size=20_000)
>>> simarr = chemfp.simarray(queries=subset1, targets=subset2,
metric="Hamming", progress=False)
>>> simarr.get_out_description()
'5000x20000 array of dtype uint16'
>>> import numpy as np
>>> for count, bin in zip(*np.histogram(simarr.out, bins=20)):
... print(f"{int(bin):4d}: {count}")
...
0: 277
14: 10988
29: 482755
44: 7157451
59: 29524500
74: 36458980
89: 16786454
103: 5651194
118: 2229461
133: 980313
148: 396547
163: 171267
178: 91759
193: 33880
207: 16056
222: 5979
237: 1896
252: 213
267: 26
282: 4
This functionality is also available through the simarray subcommand, both for a single query:
% chemfp simarray \
--hex-query 000000003000000001d414c913a3914380b03a6a1b \
maccs166.fpb -o maccs.npy
and for a set of queries against targets:
% chemfp simarray --queries 10_queries.fps chembl_33.fpb \
--as-distance -o jaccard_distance.npy
Metrics and data types¶
In simarray, the term “metric” refers to the fingerprint comparison method, which may be a similarity (“Tanimoto”, “Dice”, or “cosine”), a distance (“Hamming”), or one of three different “abcd” methods described in the next subsection. There is also an option to convert a similarity into a distance.
Some metrics supports multiple NumPy data types. If neither the metric nor dtype is specified then chemfp will compute the Tanimoto similarity using the float64 datatype. (What C calls a “double”) and if the dtype is “float32” it will use a 32-bit float, which takes half the space with no information less for 4096-bit fingerprints or smaller. (2117/4142 and 2094/4097 are the same float32 value.)
If 4 bytes per score is too high and you are willing to bin similar
scores together, use the “uint16” dtype to compute Tanimoto as a
scaled unsigned 16-bit integer computed as int(score * 65535)
.
On the other hand, use “rational64” or “rational32” if you want the exact ratio, stored as a structured data type with pair of respectively uint32 or uint16 values containing the numerator (“p”) and denominator (“q”), where the numerator is the intersection popcount and the denominator is the union popcount. The rational32 dtype uses the same space as float32 but with no information loss.
The Dice similarity also supports the float64 (the default), float32, scale uint16, rational64, and rational32 dtypes.
The cosine similarity only supports float64, float32, and scaled uint16.
The Hamming distance only supports the uint16 dtype.
If the metric is “Tanimoto”, “Dice”, or “cosine” and
as_distance=True
(in the API) or --as-distance
(on the
command-line), then the distance value 1-similarity
will be
computed instead, or the corresponding scaled version.
If the similarity is 0/0 then the float64, float32, and scale uint16
values will be 0, while the distance version wil be 1. The rational
similarity will be p=q=0
while the distance will be p=q=n
where n
is the number of fingerprint bits.
The “abcd” data type¶
What if you want a different metric?
There are scores to choose from. The paper Holliday, J.D., Hu, C.-Y. and Willett, P. (2002) Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-strings. Combinatorial Chemistry and High Throughput Screening 5(2) pp.155-166 compares 22 different ones, and suggests researches also consider the “Russell/Rao, Baroni-Urbani/Buser, SimpleMatch, Stiles, Ochiai/Cosine, and Kulczynski(2) coefficients.”
The paper has a table which describes those coefficients in terms of “a”, “b”, “c”, and “d”. For example, when expressed as a Python or C expression:
Baroni-Urbani/Buser: (sqrt(a*d) + a) / (sqrt(a*d) + a + b + c)
What are “a”, “b”, “c”, and “d”? From the paper:
Assume that the bit-strings XP and XQ denote two molecules P and Q, respectively, and that each of these strings contains a total of n bit positions. Assume further that b and c of these n bits are set to one only in XP and only in XQ, respectively, that a of these n bits are set to one in both XP and XQ, and that d of these n bits are not set in either XP and XQ (so that n = a + b + c + d).
In chemfp, this is the “Sheffield” convention, and the corresponding metric stores the comparison in a structured data type with 4 uint16 element named “a”, “b”, “c”, and “d”. Here’s an example:
>>> import chemfp
>>> simarr = chemfp.simarray(targets="distinct.fps", metric="Sheffield", progress=False)
>>> simarr.get_out_description()
'12x12 symmetric array of dtype abcd'
>>> simarr.out.dtype
dtype([('a', '<u2'), ('b', '<u2'), ('c', '<u2'), ('d', '<u2')])
The easiest way to compute the Baroni-Urbani/Buser association coefficient is using NumPy’s own operators:
import numpy as np
def baroni_urbani(abcd_arr):
a, b, c, d = abcd_arr["a"], abcd_arr["b"], abcd_arr["c"], abcd_arr["d"]
numerator = np.sqrt(a * d) + a
return numerator / (numerator + b + c)
which generates the following values:
>>> baroni_urbani(simarr.out)[:4, :4]
array([[1. , 0.43166688, 0.35355338, 0.37293836],
[0.43166688, 1. , 0.4552797 , 0.42 ],
[0.35355338, 0.4552797 , 1. , 0.35606548],
[0.37293836, 0.42 , 0.35606548, 1. ]], dtype=float32)
No doubt there are faster methods, including clever ones which mutate the array in-place. That exercise is left to the reader.
chemfp’s three “abcd” metrics¶
The previous section referenced a paper by Holliday et al. which used what chemfp calls the “Sheffield” convention for the “a”, “b”, “c”, and “d” values. That is one of three conventions you are are likely to come across in cheminformatics papers.
The most common is the “Willett” convention:
a is the number of bits “on” in fingerprint A
b is the number of bits “on” in fingerprint B
c is the number of bits “on” in both fingerprints A and B
d is the number of bits “off” in both fingerprints A and B
This is used by Sheffield publications between roughly the mid 1980s to roughly the mid 2000s, especially those by Peter Willett, as well as those external to Sheffield. (chemfp internally uses this convention.)
The next most common is the “Sheffield” convention, which is:
a is the number of bits “on” in both fingerprints A and B
b is the number of bits “on” in fingerprint A which are “off” in fingerprint B
c is the number of bits “on” in fingerprint B which are “off” in fingerprint A
d is the number of bits “off” in both fingerprints A and B
The “Sheffield” convention was used by papers from the Sheffield group up until the mid 1980s, and in most Sheffield papers past about 2008, or starting in 2002 if John Holliday is a co-author. The naming is, alas, confusing because Willett was at Sheffield and Willett’s later papers switched from “Willett” convention to “Sheffield” convention. I chose it because it best reflects the impact of Willett’s work promoting molecular similarity during the 1980s and 1990s.
In third place, and far less common, is the “Daylight” convention, which comes from John Bradshaw, is described in the Daylight theory page on “Fingerprints - Screening and Similarity”, and is used by the Daylight software:
a is the number of bits “on” in fingerprint A which are “off” in fingerprint B
b is the number of bits “on” in fingerprint B which are “off” in fingerprint A
c is the number of bits “on” in both fingerprints A and B
d is the number of bits “off” in both fingerprints A and B
The low-level simarray API¶
The high-level API is built on a lower-level API based around a
SimarrayProcessor
, which is created via one of three class
methods. Once created, use next(n)
to process at least n comparisons (it may process more depending on
the size of the matrix and number of threads; the function returns the
number actually processed) or use process_all()
to process all remaining
comparisons, along with a progress bar:
>>> import chemfp
>>> a = chemfp.load_fingerprints("distinct.fps")
>>> import chemfp.search
>>> processor = chemfp.search.SimarrayProcessor.from_symmetric(a, metric="Hamming")
>>> processor.num_to_process
78
>>> processor.next(5)
12
>>> processor.process_all()
scores: 100%|█████████████████████████| 66/66 [00:00<00:00, 1048576.00/s]
The three class constructors are
SimarrayProcessor.from_symmetric()
(shown in the example),
SimarrayProcessor.from_NxM()
and
SimarrayProcessor.from_query_fp()
.
I do not expect most people to use the low-level API. If you do, let me know as I’m curious to know what you think.
RDKit fingerprint generation¶
Chemfp’s support for the RDKit Morgan, AtomPair, Torsion, and RDKit-Fingerprint fingerprint types have changed.
Important
The default Morgan fingerprint radius is now 3 instead of 2!
Since around 2020 the RDKit toolkit has been migrating how fingerprints are generated, from a function-based API to a more object-oriented one based around fingerprint “generators”, with new features only available through the new API.
The fingerprints from the new generators are not exactly the same as
the old, in part because a few of the more obscure parameters (like
tgtDensity
) have been removed, and a few others renamed, but the
underlying methods are the same.
Chemfp 4.2 support both old-style and new-style fingerprint generation by using different version numbers. For example, the “RDKit-Morgan/1” fingerprint type generates old-style fingerprints while “RDKit-Morgan/2” generates new-style fingerprints. If the version is unspecified then chemfp will use the most recent version available in your version of RDKit.
Count simulation¶
The most interesting new feature in RDKit’s fingerprint generator is “count simulation”, which lets a binary fingerprint capture some of the abilities of a count fingerprint by setting additional bits based on the corresponding feature count.
For example, by default RDKit’s Morgan fingerprint sets one and only one bit if a feature is present, no matter how many times it is present. Consider this example:
>>> from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
>>> morgan3_gen = GetMorganGenerator()
>>> DataStructs.TanimotoSimilarity(morgan3_gen.GetFingerprint(mol1),
... morgan3_gen.GetFingerprint(mol2))
0.3684210526315789
>>> DataStructs.TanimotoSimilarity(morgan3_gen.GetCountFingerprint(mol1),
... morgan3_gen.GetCountFingerprint(mol2))
0.5172413793103449
Careful examination shows the binary fingerprint only considers feature presence or absence, like in the following where feature 1873 is present 5 times in the Morgan count fingerprint, but only bit 1873 is set in the corresponding binary fingerprint:
>>> morgan3_gen.GetCountFingerprint(mol1).GetNonzeroElements()
{389: 1, 392: 1, 496: 2, 888: 1, 1088: 3, 1171: 1, 1199: 2, 1380: 1,
1457: 1, 1750: 2, 1804: 1, 1873: 5, 1947: 1}
>>> list(morgan3_gen.GetFingerprint(mol1).GetOnBits())
[389, 392, 496, 888, 1088, 1171, 1199, 1380, 1457, 1750, 1804, 1873, 1947]
>>> len(morgan3_gen.GetFingerprint(mol1).GetOnBits())
13
Count simulation lets us include some of the count information in the
binary fingerprint, by setting multiple bits based on the counts. I’ll
create a Morgan fingerprint generator with count simulation enabled,
and with specified countBounds
of “1, 4”. This says that if the
count is at least 1 then set the first bit, and if it’s at least 4
then set the second (consecutive) bit. Since only one feature is
present 4-or-more times, the number of expected bits should increase
by one, and get us a bit closer to the count Tanimoto:
>>> count_sim_gen = GetMorganGenerator(countSimulation=True, countBounds=[1, 4])
>>> DataStructs.TanimotoSimilarity(count_sim_gen.GetFingerprint(mol1),
... count_sim_gen.GetFingerprint(mol2))
0.4
>>> list(count_sim_gen.GetFingerprint(mol1).GetOnBits())
[128, 294, 350, 712, 778, 784, 866, 992, 1452, 1560, 1698, 1699, 1776, 1846]
>>> len(count_sim_gen.GetFingerprint(mol1).GetOnBits())
14
You may have noticed how the bit values changed. This is due to how count emulation works. It generate the bits in a smaller fingerprint space (scaled by the number of bounds, i.e., bits to set), then scales them back up using the bounds to determine which bits to set. The implementation works like this, although in C++:
>>> scaled_fp_gen = GetMorganGenerator(fpSize=2048//2)
>>> counts = scaled_fp_gen.GetCountFingerprint(mol1).GetNonzeroElements()
>>> counts
{64: 3, 147: 1, 175: 2, 356: 1, 389: 1, 392: 1, 433: 1, 496: 2, 726: 2, 780: 1, 849: 5, 888: 1, 923: 1}
>>> bits = set()
>>> for bitno, count in counts.items():
... if count >= 1: bits.add(bitno*2)
... if count >= 4: bits.add(bitno*2+1)
...
>>> sorted(bits)
[128, 294, 350, 712, 778, 784, 866, 992, 1452, 1560, 1698, 1699, 1776, 1846]
If countSimulation
is enabled but countBounds
is not set then
the default is “1,2,4,8”, which results in a pretty close match to the
actual count fingerprint of 0.517 computed earlier:
>>> morgan3_count_sim_gen = GetMorganGenerator(countSimulation=True)
>>> DataStructs.TanimotoSimilarity(morgan3_count_sim_gen.GetFingerprint(mol1),
... morgan3_count_sim_gen.GetFingerprint(mol2))
0.52
For an RDKit blog post about the topic, see: https://greglandrum.github.io/rdkit-blog/posts/2021-07-06-simulating-counts.html
RDKit-Fingerprint type changes¶
The RDKit-Fingerprint (RDKit’s version of the Daylight path, with tree support) has changed from version 2 to version 3.
The “RDKit-Fingerprint/3” options are:
fpSize: number of bits in the fingerprint (default: 2048)
minPath: minimum number of bonds (default: 1)
maxPath: maximum number of bonds (default: 7)
countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)
countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)
nBitsPerFeature: number of bits to set for each path feature (default: 2)
useHs: include information about the number of hydrogens on each atom? (default: True)
branchedPaths: include both branched and unbranched paths (default: True)
useBondOrder: use both bond orders in the path hashes (default: True)
The old nBitsPerHash
has been renamed nBitsPerFeature
, to
match the RDKit API change, and the count simulation parameters are
new.
The default type string is “RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7”, where the remaining parameters, if not specified, have their default value.
This is the default fingerprint type when you use “rdkit2fps”. It is
also available by specifying ‑‑RDK
or ‑‑RDK/3
. The older
fingerprint type is available with ‑‑RDK/2
. Alternatively you can
specify the type string using ‑‑type
.
In the Python API this fingerprint type object is available as
chemfp.rdkit_toolkit.rdk
or
chemfp.rdkit_toolkit.rdk_v3
, with the older type available as
chemfp.rdkit_toolkit.rdk_v2
.
RDKit-Morgan type changes¶
The RDKit-Morgan fingerprint (RDKit’s version of the ECFP and FCFP circular fingerprints) has changed from version 1 to version 2.
Important
The default radius has changed from radius=2 to radius=3, to reflect the default value in GetMorganGenerator()! Previously RDKit did not have a suggested default, so I choose radius=2.
The “RDKit-Morgan/2” options are:
fpSize: number of bits in the fingerprint (default: 2048)
radius: radius for the Morgan algorithm (default: 3) (was 2 in v1!)
useFeatures: use chemical-feature invariants (default: 0)
countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)
countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)
includeChirality: include chirality information in the bond invariants (default: 0)
useBondTypes: (default: 1)
includeRingMembership: if 1, include ring membership in the atom invariants (default: 1)
includeRedundantEnvironments: if 1, include redundant environments in the fingerprint (default: 0)
The old useChirality
has been renamed to includeChirality
. The
includeRingMembership
is new (version 1 always included ring
membership), as are the count simulation options.
The new default type string is “RDKit-Morgan/2 fpSize=2048 radius=3
useFeatures=0”, where the remaining parameters, if not specified, have
their default value. The fpSize
and radius
parameters have
changed position from version 1. The useFeatures
is always
included so people can tell if this is an ECFP-like or FCFP-like
fingerprint. If useFeatures
is True then chemfp will use RDKit’s
rdFingerprintGenerator.AtomInvariantsGenerator()
as the atom
invariant generator.
To specify the Morgan/2 fingerprint type (with radius=3) with
rdkit2fps use ‑‑morgan
or ‑‑morgan3
. Use ‑‑morgan1
,
‑‑morgan2
, ‑‑morgan3
, or ‑‑morgan4
to specify a radius of
1, 2, 3, or 4. Append a /2 to be specific about the version,
e.g. ‑‑morgan/2
.
To use the older Morgan/1 fingerprint type, append /1
, that is,
use ‑‑morgan/1
for the Morgan/1 fingerprints (with radius=2) and
use ‑‑morgan1/1
, ‑‑morgan2/1
, ‑‑morgan3/1
, or
‑‑morgan4/1
to use a Morgan/1 fingerprint with radius of 1, 2, 3,
or 4, respectively.
In the Python API these fingerprint type objects are available in the
chemfp.rdkit_toolkit
module as:
morgan
: latest Morgan fingerprint, radius=3morgan1
: latest Morgan fingerprint, radius=1morgan2
: latest Morgan fingerprint, radius=2morgan3
: latest Morgan fingerprint, radius=3morgan4
: latest Morgan fingerprint, radius=4morgan_v2
: Morgan/2, radius = 3morgan1_v2
: Morgan/2, radius = 1morgan2_v2
: Morgan/2, radius = 2morgan3_v2
: Morgan/2, radius = 3morgan4_v2
: Morgan/2, radius = 4morgan_v1
: Morgan/1, radius=2 (not radius=3!)morgan1_v1
: Morgan/1, radius = 1morgan2_v1
: Morgan/1, radius = 2morgan3_v1
: Morgan/1, radius = 3morgan4_v1
: Morgan/1, radius = 4
RDKit-AtomPair type changes¶
The RDKit-AtomPair has changed from version 2 to version 3.
The “RDKit-AtomPair/3” options are:
fpSize: number of bits in the fingerprint (default: 2048)
minDistance:minimum bond distance for two atoms to be considered a pair (default: 1)
maxDistance: maximum bond distance for two atoms to be considered a pair (default: 30)
countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 0)
countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)
includeChirality: if 1, chirality will be used in the atom invariants (default: 0)
use2D: if 1, use a 2D distance matrix, if 0 use the 3D matrix from the first set of conformers, or return an empty fingerprint if no conformers (default: 1)
The AtomPair/2 minLength
and maxLength
have been renamed to
minDistance
and maxDistance
in AtomPair/3.
The AtomPair/2 nBitsPerEntry
setting is no longer available in
AtomPair/3. The old default was 4. This can be emulated with count
simulation.
The new default type string is “RDKit-AtomPair/3 fpSize=2048 minDistance=1 maxDistance=30 countSimulation=1”, where the remaining parameters, if not specified, have their default value.
To specify the AtomPair/3 fingerprint type with rdkit2fps use
‑‑pair
or ‑‑pair/3
. For the older AtomPair/2 fingerprint type,
use ‑‑pair/2.
In the Python API these fingerprint type objects are available in the
chemfp.rdkit_toolkit
module as pair
for the most recent version, or
pair_v3
for AtomPair/3 and
pair_v2
for AtomPair/2.
RDKit-Torsion type changes¶
The RDKit-Torsion (topological torsion fingerprints) has changed from version 3 to version 4.
The “RDKit-Torsion/4” options are:
fpSize: number of bits in the fingerprint (default: 2048)
torsionAtomCount: the number of atoms to include in the ‘torsions’ (default: 4)
countSimulation: simulate count fingerprints by setting more bits for higher counts (default: 1)
countBounds: list of minimum counts needed to set the corresponding bit during count simulation (default: None)
includeChirality: if 1, include chirality in the atom invariants (default: 0)
onlyShortestPaths: if 1, only include the shortest paths between the start and end atoms, not all paths (default: 0)
The targetSize
parameter is version 3 is now
torsionAtomCount
. The count simulation and onlyShortestPaths
options are new. The nBitsPerEntry
from version 3 can be
implemented with count simulation. I think the old nBitsPerEntry=4
is equivalent to countBounds
with “1,2,4,8” and other
nBitsPerEntry values equivalent to “1,2,3,4,…,$nBitsPerEntry”.
The new default type string is “RDKit-Torsion/4 fpSize=2048 torsionAtomCount=4 countSimulation=1”, where the remaining parameters, if not specified, have their default value.
To specify the Torsion/4 fingerprint type with rdkit2fps use
‑‑torsion
or ‑‑torsion/4
. For the older Torsion/3 fingerprint
type, use ‑‑torsion/3
.
In the Python API these fingerprint type objects are available in the
chemfp.rdkit_toolkit
module as torsion
or
torsion_v4
for Torsion/4 and
torsion_v3
for Torsion/3.
CDK updates¶
Chemfp now supports cdk 2.9, which includes the improvements to the
PubChem fingerprint generator. There’s also a new hydrogens
reader
argument for the SMILES and SDF readers, to specify that implicit
hydrogens should be made explicit, or vice versa.
“hydrogens” reader argument¶
Many of the CDK fingerprint types require all hydrogens be explicit hydrogens (that is, present as atoms in the molecular graph) while a couple require all hydrogens be implicit (that is, as hydrogen counts on some other atom.) To figure out which is which, read the CDK documentation or see the guidance from cdk2fps --help.
On the other hand, the CDK SMILES and SDF parser leave the hydrogens in the form present in the input, so the SMILES “C” has four implicit hydrogens while “[H][CH]([H])[H]” has three explicit hydrogens and one implicit hydrogen on the carbon.
Before chemfp 4.2 you needed to prepare the inputs SMILES or SDF to have the hydrogens in the correct form.
Chemfp 4.2 add the hydrogens
reader argument, which describes how
to process the hydrogens.
The default of as-is
leaves them in the existing state.
The value make-explicit
converts all implicit hydrogens to
explicit form.
The value make-implicit
converts all explicit hydrogens to
implicit form.
The value make-nonchiral-implicit
converts all explicit hydrogens
to implicit form except for chiral hydrogens.
The following examples use the chemfp translate <chemfp_translate> subcommand to convert an input structure file to an output structure file using the CDK. By default reads and writes in SMILES format. I’ll have it process the SMILES for methane which has a mixture of hydrogen representations, to show how “make-explicit” and “make-implicit” do:
% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=as-is
C([H])([H])[H] methane
% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=make-explicit
C([H])([H])([H])[H] methane
% echo '[H][CH]([H])[H] methane' | chemfp translate -T cdk -R hydrogens=make-implicit
C methane
The following shows how the explicit chiral hydrogen is not converted to an implicit hydrogen when using the “make-nonchiral-implicit” setting:
% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk
[C@]([H])(Cl)(F)N example
% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk -R hydrogens=make-implicit
[C@H](Cl)(F)N example
% echo 'N[C@]([H])(F)Cl example' | chemfp translate -T cdk -R hydrogens=make-nonchiral-implicit
C([H])(Cl)(F)N example
The hydrogens
reader argument is also available through the API:
>>> from chemfp.cdk_toolkit import parse_smistring
>>> parse_smistring("[H][CH2][H]").getAtomCount()
3
>>> parse_smistring("[H][CH2][H]", hydrogens="make-implicit").getAtomCount()
1
>>> parse_smistring("[H][CH2][H]", hydrogens="make-explicit").getAtomCount()
5
and through the reader_args
parameter for the reader API:
>>> import chemfp
>>> for id, fp in chemfp.read_molecule_fingerprints(
... "CDK-Daylight", "small.smi", reader_args={"hydrogens": "make-explicit"}):
... print(id, fp)
<many lines omitted>
CDK PubChem fingerprints¶
The PubChemFingerprinter CDK 2.9 no longer requires explicit hydrogens and it uses a ring set definition that better matches the CACTVS substructure keys.
The chemfp fingerprint type string for the new version is “CDK-Pubchem/2.9”, instead of the old “CDK-Pubchem/2.0”.
jCompoundMapper¶
jCompoundMapper is a Java package for fingerprint generation dating from around 2010. The publication is: Hinselmann, G., Rosenbaum, L., Jahn, A. et al. jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints. J Cheminform 3, 3 (2011). https://doi.org/10.1186/1758-2946-3-3
The paper “Effectiveness of molecular fingerprints for exploring the chemical space of natural products” by Boldini, Ballabio, Consonni, Todeschini, Grisoni, and Sieber, J. Cheminform. (2024) 16:35 https://doi.org/10.1186/s13321-024-00830-3 benchmarked “20 molecular fingerprints from four different sources” on natural products and concluded that jCompoundMapper’s All Shortest Paths (ASP) and Local Path Environments (LSTAR) fingerprints were promising.
I therefore decided to include jCompoundMapper fingerprint types as part of the chemfp 4.2 release, at the very least to help others replicate the results.
Integration is tricky. To start, you need to get a copy of one of the jCompoundMapper jar files and put the filename it on your CLASSPATH after the CDK jar. The package is so old that it’s not compatible with the modern CDK. Thankfully, CDK has a startup option to use a backwards compatible implementation, which must be specified when the JVM starts. Chemfp will set that option if it sees either of the two expected jCompoundMapper jar filenames on the CLASSPATH. However, the standard CDK fingerprints don’t work when in backwards-compatibility mode.
For the full setup details see the chemfp.jcmapper_types
documentation.
The suppported fingerprint types and default type strings are:
Depth-First Search (DFS) - “jCMapper-DFS hashsize=4096 searchDepth=7 atomLabel=ELEMENT_NEIGHBOR”
All Shortest Paths (ASP): - “jCMapper-ASP hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR”
Local Path Environments (LSTAR): - “jCMapper-LSTAR hashsize=4096 searchDepth=6 atomLabel=ELEMENT_NEIGHBOR”
Topological Molprint-like fingerprints (RAD2D) - “jCMapper-RAD2D hashsize=4096 searchDepth=3 atomLabel=ELEMENT_SYMBOL”
2-point topological pharmacophore pairs (PH2) - “jCMapper-PH2 hashsize=4096 searchDepth=8”
3-point topological pharmacophore triples (PH3) - “jCMapper-PH3 hashsize=4096 searchDepth=5”
2-point topological atom type pairs (AP2D) - “jCMapper-AP2D hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR”
3-point topological atom type triplets (AT2D)w: - “jCMapper-AT2D hashsize=4096 searchDepth=5 atomLabel=ELEMENT_NEIGHBOR”
The generated fingerprint is hashsize
bits long, which must be a
positive integer. Most fingerprint types takes a searchDepth
which
must be a non- negative integer. It specifies the maximum path length,
circular environment radius, or shell radius to consider.
Most of the fingerprint types support alternative ways to assign a
label to a given atom type, based on different atom and extended atom
properties, which in turn affects fingerprint generation. The
supported atomLabel
methods are:
“CDK_ATOM_TYPES”: CDK atom types (eg, ‘C.sp2’, ‘O.minus’)
“ELEMENT_SYMBOL”: element symbol (eg, ‘C’, ‘O’)
“ELEMENT_NEIGHBOR”: element and number of heavy atom neighbors (eg, ‘C.2’)
“ELEMENT_NEIGHBOR_RING”: element, ring type, and number of heavy atom neighbors (eg, ‘C.a.2’)
“DAYLIGHT_INVARIANT”: “Atomic number, number of heavy atom neighbors, valency minus the number of connected hydrogens, atomic mass, atomic charge, number of connected hydrogens” (eg, ‘6.2.3.12.0.1’ for a carbon in a benzole ring)
“DAYLIGHT_INVARIANT_RING”: DAYLIGHT_INVARIANT followed by a flag if the atom is in a ring (eg, ‘6.2.3.12.0.1.1’)
With cdk2fps use --type
to specify the type string, like:
cdk2fps --type jCMapper-DFS dataset.smi -o dataset.fps
Run cdk2fps --help-jcmapper on the command-line for more examples.
In the chemfp API use chemfp.get_fingerprint_type()
, like the
following two examples:
import chemfp
fptype = chemfp.get_fingerprint_type("jCMapper-LSTAR hashsize=1024"
# or
fptype = chemfp.get_fingerprint_type("jCMapper-LSTAR", {"hashsize": 1024})
Modern Python¶
Behind the scenes there was a lot of work to bring chemfp up to date with modern Python.
This release includes support for Python 3.12, which mostly meant updating to a version of Cython which supported that Python version. Python 3.12 warned me about a number of places where I didn’t quote strings correctly, which was nice. The biggest change was that Python 3.12’s datetime module now requires timezone information, while earlier Python versions didn’t support a built-in timezone. While I fixed this for now, for the chemfp 5.0 release I will likely change the FPS format to support local time in the date, rather than insist on UTC.
I have also fixed a few warnings from Pandas about upcoming API changes, like switching to “iloc” and some strange interaction causing the Cython-generated diversity interface to warn about upcoming copy-on-write changes. (I say “strange” because I could not reproduce it in regular Cpython.)
NumPy 2.0 came out just before the chemfp 4.2 release. The only issues were a couple of places where the repr() of a NumPy string array changed its form, which caused some of the simarray regression tests to fail. (The simarray result repr includes the first few query and target ids, as a hint for what the result contains.)
I’ve added initial support for type annotations. I don’t use them myself so would like feedback.
Finally, I switched the build system from the “legacy” interface to “PEP 517” interface. This turned out to be rather more difficult then I expected because, I think, the people who designed PEP 517 support for setuptools didn’t have much experience with building Python/C extensions, or for working offline like I do.
The changes only affect people building or testing from a source release. I’ve updated the README with details about the issues and how to resolve possible problems.
Experimental support for single-link clustering¶
I expect most people will use scikit-learn or similar tools for
“small” (<50,000 fingerprint) data sets, where the full comparison
matrix easily fits into 16 GB of RAM, with float32 scores. For
multi-million fingerprint data sets, chemfp already has a Butina
implementation which works directly on chemfp’s SearchResults
sparse array.
With this release I have initial support for single link clustering, along with the option to build the dendrogram. It is only available as a low-level API, without the high-level or command-line components you saw for simarray.
At this point I’m looking for people who are interested in evaluating it and providing feedback, as I don’t have the right sort of experience.
If this interests you, take a look at the example in the CHANGELOG and/or contact me.
Other changes¶
Passing a float or integer to the
progress
argument now specifies the initial delay before showing a progress bar. (As before, True means to always use the progress bar, False means to never use the progress bar, and if it’s a callable then it must implement the tqdm.tqdm function API.)The initial delay can be specified by the CHEMFP_PROGRESS environment variable either as (finite) float or integer, to mean the delay in seconds, or as “on” and “off” to always enable or disable the progress bar. The CHEMP_PROGRESSBAR environment variable is still supported. If “1” (on) or “0” (off) it overrides the CHEMFP_PROGRESS setting.
The simsearch command has a new
--ordering
option, to specify the array order. It can be “increasing-score”, “decreasing-score”, “increasing-score-plus”, “decreasing-score-plus”, and “reverse”. The “*-plus” options break ties consistently, based on index or id, depending on the search type.The Butina
--times
now also reports the sort time, if needed.The MaxMin output
#type
line now usesnum-initial-picks
instead of--num-initial-picks
.The simsearch “npz” JSON metadata array now also includes the array shape.
Added a
suppress_log_output()
context manager to the toolkit APIs, to disable the toolkit-specific warning messages sent to the underlying C API. It is used like:
with tk.suppress_log_output():
tk.parse_smistring(smi) # may generate a warning message.
The high-level
chemfp.simsearch()
function and the lower-levelchemfp.search
methods now take a num_threads parameter, to specify directly the number of threads to use. If None (the default), this uses the value returned bychemfp.get_num_threads()
.Fixed spherex
--dise-references
so it actually works.Replaced the function
chemfp.get_max_threads()
(documented as unusable) withchemfp.get_omp_num_threads()
, which makes an OpenMP team and returns the number of threads OpenMP actually creates, or 1 if OpenMP is not available.Added the new arena method
get_popcount_offsets()
. If there is a popcount index (which means the fingerprints are ordered by popcount) then this returns an array.array(“I”) of offsets, where the fingerprints with popcount p are at index i wherearr[p] <= i < arr[p+1]
. If there is no index, this returns None.Added
SearchResults.to_numpy_array()
to get the similarity matrix for NxN and NxM search results as a dense NumPy array. The returned matrix has the same shape, (#queries, #targets) as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.Added Marvin support for RDKit in parse_molecule(), parse_id_and_molecule(), create_string() and create_bytes(), by using the “mrv”
format
.FPB id lookup now supports negative indexing.