What’s new in chemfp 5.0¶

Version 5.0b2 released 20 August 2025.

The main additions to chemfp 5.0 are:

Update the FPB format to handle over 1 billion fingerprints.
New chemfp shardsearch command-line tool which does similarity search across multiple target files and merges the result.
New chemfp simhistogram / chemfp simhist command-line tool and corresponding chemfp.simhistogram() high-level API function to create a histogram of similarity scores.
Experimental support to generate sparse count fingerprints using RDKit along with methods to convert sparse count fingerprints to binary fingerprints and vice versa.
Implementations of the 4860-bit Klekota-Roth fingerprint for the OpenEye and RDKit toolkits.

See below for more details.

Backwards-incompatible changes¶

Dropped support for Python 3.8.

If no explicit fingerprint type is specified then rdkit2fps now generates Morgan fingerprints with radius 3 instead of RDKit’s Daylight-like fingerprints. Use an explicit --RDK to the command-line to get the RDKit fingerprint type. This change was documented in the chemfp 4.2 release notes.

The toolkit helper functions matching the pattern “from_{format}”, “to_{format}”, “from_{format}_file” and “to_{format}_file” (like chemfp.rdkit_toolkit.from_smi to parse a SMILES record into a molecule) were added in chemfp 4.0 and documented as deprecated in chemfp 4.1. Using them in chemfp 4.2 generated a DeprecationWarnings. With chemfp 5.0 they have been removed. Instead, use the “parse_{format}” and “create_{format}” functions, like parse_smi() instead of from_smi().

The chemfp.bitops functions byte_difference() and hex_difference() have been removed. Using them generated a PendingDeprecationWarning in chemfp 4.1 and a DeprecationWarning in chemfp 4.2. Instead, use byte_xor and hex_xor.

The fingerprint type method from_mol() and from_mols() were documented as deprecated in chemfp 4.2. Using them now generates a DeprecationWarning. They will be removed by chemfp 5.2. Instead, use compute_fingerprint() or compute_fingerprints().

The high-level chemfp.simsearch(), chemfp.maxmin(), chemfp.spherex() chemfp.heapsweep() functions provide access to the low-level data structure as the attribute out. Originally this was stored in the result attribute but experience showed that referring to the low-level object as result.result was more confusing than using result.out. The out alias for result was added in chemp 4.2. Using result in chemfp 5.0 gives a PendingDeprecationWarning. In chemfp 5.1 it will give a DeprecationWarning, with removal in chemfp 5.2.

WARNING: RDKit 2024.09.3 fixed a bug in how Morgan fingerprints handled chiral atoms (see “Morgan fingerprints with chirality distinguish chiral atoms too early”), causing some fingerprints to change when includeChirality=1. This does not affect the default Morgan fingerprints, which has includeChirality=0.

Previously this would trigger a version number change in the chemfp fingerprint, from RDKit-Morgan/2 to RDKit-Morgan/3. However, by design the version number change triggers downstream compatibility warnings, because chemfp has no way to figure out if the version difference is meaningful for a given set of options, and I haven’t figured out a good solution.

WARNING: RDKit versions 2024.9.2 changed the implementation of includeRedundantEnvironments=1 but the RDKit-Morgan version number has not changed, for the same reason as above.

NOT YET: The chemfp 4.2 release notes mention that chemfp 5.0 will change so that progress bars aren’t shown unless there is a certain minimum delay. This change has not been done, but may in the future.

NOT YET: The chemfp 4.2 release notes mention that the “npz” JSON metadata array format “will likely change”, to be more consistent with then-new simarray JSON. This has not yet happened but may in the future.

Updated FPB format¶

The original FPB format could not handle more than about 270 million fingerprints due to an implementation detail in the “HASH” chunk. Chemfp 5.0 supports a new “HSH2” chunk which should handle at least 2 billion fingerprints, though it has only been tested with 1 billion fingerprints.

See the FPB specification for details.

Warning

Even though the --max-spool-size of the fpcat command-line tool helps limit the amount of RAM needed, the final FPB generation step loads all of the ids into memory. You will likely need about 35-40 GB of RAM to generate an FPB file with 1 billion fingerprints.

shardsearch¶

The new chemfp shardsearch command-line tool does similarity search across multiple target files, possibly in parallel, and merges the result.

Chemfp’s simsearch searches a single target file. Sometimes it is useful to search multiple target files, which I’ll refer to as a “shard.”

The most common case is to work with data sets in the ~1-10 billion fingerprint range, where it is difficult or impossible to store all of the data into a single FPB file, even with the new HSH2 chunk mentioned above. Instead, split the data set into a dozen or two shards of 100M-200M fingerprints each.

% chemfp shardsearch --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
        --threshold 0.2 shard*.fpb --out csv
query_id,target_id,score
Query1,CHEMBL327316,0.2203390
Query1,CHEMBL539424,0.2340426

Another common case is to handle data sets which are larger than available RAM. Chemfp by default uses memory-mapped FPB files, which lets it work with datasets that are larger than memory. While the operating system will cache the most commonly used parts of the file, as the file size increases the cache becomes less useful, which means the data must be re-read from the much slower hard disk or network file system.

What should take seconds in RAM may instead take 20 minutes.

Shardsearch can be configured to process a user-defined number of queries against each shard in order. If the shard fits into memory then the data is only loaded once.

If the shards are on a network file system then you should compress the FPB files with ZStandard, that is, as fpb.zst files, because most of the time will be spent transfering the data from the remote computer, and decompressing is fast.

Finally, some data sets have a monthly or quarterly primary release with daily or weekly updates for records which changed. It can be easier to use one shard for each update cycle, do the shardsearch, and post-process the output to handle possible duplicates than to create a single FPB file for the entire dataset.

simhistogram¶

The new “simhistogram” feature computes a histogram of the Tanimoto similarity scores in a fingerprint dataset with itself (“NxN”), or between two different fingerprint datasets (“NxM”). It is available using the chemfp simhistogram command-line tool or in the Python API using the chemfp.simhistogram() function.

The command-line tool, when used to generate an NxM histogram, can also include the NxN histograms for each of the two input dataset.

These histograms can give a sense of the overall similarity distribution. For example, the following image shows the distribution of Tanimoto scores between ChEBI and ChEMBL, and compares that distribution with the distribution of scores for each dataset with itself.

Histograms of the Tanimoto score distribution of ChEBI vs ChEMBL, and of each dataset with itself. Based on 100,000 random samples (without duplicates) using RDKit Morgan fingerprint with radius 2.

It was generated on the command-line with the following command-line, the progress bar, and the first 27 lines of output (the first line of output starts with #simhistogram/1):

% chemfp simhist --queries chebi_morgan.fpb chembl_35.fpb \
      --num-samples 100_000 --include-inputs
NxM: 100%|██████████████████████████| 100000/100000 [00:03<00:00, 31624.51/s]
#simhistogram/1 identity-bin=0
#type=sample matrix-type=NxM bins=100 num-samples=100000 seed=1719484950
#size=281256950220 queries=113658 targets=2474590
#identical=0
#average=0.090 min=0.085 max=0.095
#queries-type=sample matrix-type=upper-triangular bins=100 num-samples=100000 seed=2822803993
#queries-size=6459013653 N=113658
#queries-identical=3
#queries-average=0.098 min=0.093 max=0.103
#targets-type=sample matrix-type=upper-triangular bins=100 num-samples=100000 seed=336081226
#targets-size=3061796596755 N=2474590
#targets-identical=0
#targets-average=0.111 min=0.106 max=0.116
#queries=chebi_morgan.fpb
#targets=chembl_35.fpb
start end     count   percent queries_count   queries_percent targets_count   targets_percent
0.00  0.01    1981    1.981   3492    3.492   110     0.110
0.01  0.02    2166    2.166   2436    2.436   278     0.278
0.02  0.03    2837    2.837   3087    3.087   533     0.533
0.03  0.04    4048    4.048   4014    4.014   1043    1.043
0.04  0.05    5646    5.646   5696    5.696   2023    2.023
0.05  0.06    7541    7.541   7633    7.633   3756    3.756
0.06  0.07    8779    8.779   8854    8.854   5775    5.775
0.07  0.08    9532    9.532   9401    9.401   7931    7.931
0.08  0.09    9937    9.937   9563    9.563   9842    9.842
0.09  0.10    8784    8.784   8215    8.215   9944    9.944
0.10  0.11    8901    8.901   8102    8.102   11587   11.587

The simhist subcommand is an alias for simhistogram.

This shows the default output format, which includes a header second containing metadata about the calculation. Use --out or an appropriate output filename extension to get just the histogram data formatted as “tsv” or “csv”.

simhistogram output header¶

The output contains a header followed by the histogram lines. The first header line identifies the format version and subtype. The #type line describes how the histogram was generated, including any parameters. In this case it contains 100,000 scores randomly sampled from an NxM similarity matrix, placed into 100 bins. The initial RNG seed, in this case generated by Python’s built-in RNG, is 1719484950.

The #size line reports the ChEBI and ChEMBL datasets have 113,658 and 2,474,590 fingerprints, respectively, so the full NxM comparison matrix contains over 281 billion scores.

Based on the histogram data (computing by the weighted average of bin count * bin center), the #average ChEBI vs. ChEMBL score is 0.090. This may not be the actual average score as the actual scores may be anywhere in the bin, but it’s certainly between 0.093 and 0.103, because in this case the bin width is 0.1 and there are no #identical scores. (Internally chemfp always places the identical scores in their own bin, with a bin center of 1.0 and width of 0.0, even with the default of --no-identity-bin, which affects how average, min, and max are calculated.)

The lines starting #queries- give the corresponding details for strictly upper-diagonal triange of the similarity scores of the --queries dataset with itself, and the lines starting #targets- do the same for the targets.

The #queries and #targets lines describe the source for the queries and targets, which is typically the filename.

By default (with --no-identity_bin) there are N bins and each bin has the width N / --num-bins. The first N-1 bins are half-closed/half-open, while the last bin is closed. For example, if there are 2 bins then bin 0 contains the count of scores S where 0 <= S < 0.5, and bin 1 contains the count of scores where 0 <= S <= 0.5.

With --identity-bin, the first N bins are all half-closed/half open, and last bin contains the count of scores which are exactly 1.0.

simhistogram output data¶

The lines after the header contain the histogram data, starting with column titles.

If the output is a single histogram (the default) then the column titles are “start”, “end”, “count”, and “percent”. If the output contains three histograms (using --include-inputs) then the column titles are “start”, “end”, “count”, “percent”, “queries_count”, “queries_percent”, “targets_count”, and “targets_percent”.

The “start” and “end” columns contain the bin start and end positions, which for bin i<=N are i/N and (i+1)/N, respectively. The optional --identity- bin start and end positions are always 1.0.

The “count” column contains the number of scores in the given bin, and “percent” is the number of scores scaled by the number of samples.

The “queries_count”, “queries_percent”, “targets_count”, and “targets_percent” columns contain the query-specific or target-specific equivalents to “count” and “percent”.

The “csv” or “tsv” output formats only show the output data. For example, the following generates a histogram in CSV format with 11 bins (10 bins plus a special bin for scores which are exactly 1.0) using 100 million samples, without duplicates, of the over 3 trillion Tanimoto scores when comparing ChEMBL 35 with itself:

% chemfp simhist chembl_35.fpb --num-bins 10 --identity-bin
    --num-samples 100_000_000 --seed 12345 --out csv --no-progress
0,0.1,41311159,41.311
1,0.2,56230689,56.231
2,0.3,2362965,2.363
3,0.4,81778,0.082
4,0.5,9764,0.010
5,0.6,2465,0.002
6,0.7,776,0.001
7,0.8,267,0.000
8,0.9,93,0.000
9,1.0,29,0.000
0,1.0,15,0.000

The --seed initializes the RNG used for sampling, making the output reproducible. The --no-progress disables the progress bar.

High-level chemfp.simhistogram function¶

Use the new chemfp.simhistogram() function to create a histogram using Python.

The API is influenced by the NumPy histogram function API, which uses bins (NOT num_bins!) to specify the number of bins. The histogram result can be assigned to (hist, bin_edges) containing the histogram values and the bin edgets, respectively.

For example:

>>> import chemfp
>>> hist, bin_edges = chemfp.simhistogram(targets="chembl_35.fpb", bins=5, progress=False)
>>> hist
array('Q', [24369, 627, 4, 0, 0])
>>> bin_edges
array('d', [0.0, 0.2, 0.4, 0.6, 0.8, 1.0])

The actual SimHistogramResult object (which implements the item lookup [0] and [1] used for the above 2-element tuple assignment) stores more information about the results, query parameters, and timings:

>>> result = chemfp.simhistogram(targets="chembl_35.fpb", bins=10,
...              num_samples=100_000_000, seed=12345)
scores: 100%|█████████████████████████████████████| 100M/100M [00:19<00:00, 5.12M/s]
>>> result
SimHistogramResult(<sampled upper-triangle, 100000000 pairs (0.003%), seed=12345,
10 bins>, times=<total: 19.56 s>)
>>> result.bins
array('Q', [41311159, 56230689, 2362965, 81778, 9764, 2465, 776, 267, 93, 44])
>>> result.edges
array('d', [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
>>> result.num_identical
15
>>> result.seed
12345
>>> f"{result.num_processed:,} of {result.total_size:,}"
'100,000,000 of 3,061,796,596,755'
>>> result.times
{'load_queries': None, 'load_targets': 0.00724029541015625, 'init': 0.0054018497467041016,
'process': 19.547301769256592, 'total': 19.55998182296753}
>>> result.close()

sparse count fingerprints¶

Chemfp 5.0 adds experimental and limited support for sparse count fingerprints.

The new FPC format is a text-based format for sparse count fingerprints.

The new chemfp rdkit2fpc subcommand uses RDKit to generate sparse count fingerprints in FPC format from structure files.

The new chemfp fpc2fps subcommand implements several different methods to convert a sparse count fingerprint into a (dense) binary fingerprint in FPS or FPB format.

The new chemfp fps2fpc subcommand converts binary fingerprints into FPC format.

FPC format¶

The new FPC format for sparse count fingerprints is almost identical to the FPS format format for binary fingerprints. Here is an example:

#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-15T11:17:47+00:00
847433064,2245897107,2551683561       CHEMBL183419
864674487,1215180924,1600340860:2,2215059400,2245384272:2,2246728737:2,3542456614:2,3994088662:2      CHEMBL16264

The main difference is in the fingerprint records lines, where the hex-encoded fingerprint is replaced with a text encoding for sparse count fingerprints.

Each sparse count fingerprint contains a sequence of zero or more features. If there are no features then the corresponding text encoding is *, otherwise it’s a comma-separated list of text-encoded features.

Each feature contains a feature id (also called a bit position), which is a non-negative integer less than 2^64, and a feature count, which is a non-negative integer less than 2^32.

The feature encoding is the string with the feature id, followed by a colon, followed by the count, as in “{feature id}:{feature count}”. For example, if the feature id is 87501 and the feature count is 505 then the encoded feature is “87501:505”.

As a special case, if the feature count is 1 then the feature may be encoded as just the feature id, eg, “87501” instead of “87501:1”.

A feature count of 0 is valid, in which case the feature may be ignored.

Lastly, the feature ids must be in increasing order.

rdkit2fpc¶

The rdkit2fpc subcommand uses RDKit to generate Morgan, RDKit (Daylight-like), AtomPair, and Torsion count fingerprints, with Morgan fingerprints of radius 3 as the default if a specific type is not given. Here are examples of each type using CO2 read from stdin as a SMILES file:

% echo "O=C=O carbon dioxide" | chemfp rdkit2fpc --morgan
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-18T12:30:31+00:00
864942730:2,1517923320:2,2245900962,2962078081        carbon dioxide
%
% echo "O=C=O carbon dioxide" | chemfp rdkit2fpc --RDK
#FPC1
#type=RDKit-CountFingerprint/3 minPath=1 maxPath=7
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-18T12:30:49+00:00
1488142319,1628256115:2,1638855635,2172716083:2       carbon dioxide
%
%  echo "O=C=O carbon dioxide" | chemfp rdkit2fpc --pair
#FPC1
#type=RDKit-AtomPairCount/3 minDistance=1 maxDistance=30
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-18T12:31:02+00:00
1721921:2,1723682     carbon dioxide
%
% echo "O=C=O carbon dioxide" | chemfp rdkit2fpc --torsion
#FPC1
#type=RDKit-TorsionCount/4 torsionAtomCount=4
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-18T12:31:05+00:00
*     carbon dioxide

fpc2fps¶

Chemfp does not currently support similarity search directly on FPC files. Instead, sparse count fingerprints must be converted to FPS or FPB format using the chemfp fpc2fps command, then searched as binary fingerprints.

The FPC format supports (2^64 * 2^32) = 2^96 distinct features. There is no general purpose method to convert them to a short dense fingerprint (typically with a few thousand bits) where the generalized Jaccard index/Tanimoto score between two fingerprint pairs is preserved.

On the other hand, approximate methods may be good enough. Let N be the number of bits in the binary fingerprint and for a given feature let i be the feature id (ie, the sparse bit position) and n be the feature count.

One classic method is to “fold” each feature. For each feature, compute the binary bit position b by computing b = i % N then set that bit to 1. Ignore the count.

This is the default RDKit method, following the footsteps of Daylight which introduced this approach.

The “fold” method only distinguishes between counts of zero and not-zero, and cannot distinguish between two features ids which fold to the same bit. In the following, which uses the Morgan3 count fingerprint for CO2, you can see that four output bits are set. They come from the four non-zero feature ids:

% printf '30:2,1517923320:2,2245900962,2962078081\tCO2\n' | \
    chemfp fpc2fps --num-bits 32 --fold
#FPS1
#num_bits=32
#type=fold/1 num_bits=32
06000041      CO2

The fpc2fps default method is “superimpose”, which applies aspects of the Zatocoding method of Calvin Mooers and the RNG-seed of Daylight fingerprints to better emulate count fingerprints

The key observation is that many binary fingerprints, like the folded circular Morgan fingerprints, set fewer than 5% of the bits, and the total count (the sum of count for each feature) for the original sparse count fingerprints is still small relative to the binary fingerprint size.

For example, of the 2.37 million Morgan fingerprints with radius=3 computed from ChEMBL 33, only 62,000 have a total count of 100 or more, and only 600 have a total count of 200 or more, with CHEMBL3298848 having the highest total count at 284.

On the other hand, the corresponding binary fingerprints are typically 1024 to 4096 bits. These many zeros could be used to encode count information.

The “superimpose” method can be described with the following pseudocode:

binary_fp= BinaryFingerprint(num_bits)
for (feature_id, feature_count in count_fp.features:
    rng = RNG(feature_id)
    feature_count = min(feature_count, max_feature_count)
    for _ in range(feature_count * bits_per_count):
        bitno = rng.randrange(num_bits)
        binary_fp.SetBitToOn(bitno)

That sets count * bits_per_count output bits for each feature id, selected at random (with replacement) from an RNG seeded by the feature id, with a possible upper limit to the count. As bits_per_count and num_bits increase, the effect of collisions decreases, though no one has yet characterized its effectiveness.

In the following you can see that five output bits are set, while the total feature count is 2+2+1+1= 6:

% printf '30:2,1517923320:2,2245900962,2962078081\tCO2\n' | \
    chemfp fpc2fps --num-bits 32 --superimpose
#FPS1
#num_bits=32
#type=superimpose/1 num_bits=32
04010045      CO2

The “rdkit” method implements RDKit’s “count simulation” option, which takes a C count bounds {c_j} where 0 <= j < C, to first create a dense count fingerprint with N / C counts by merging the sparse count fingerprint into the dense one with the sum S of all feature counts where i % (N/C) is the same, then using the sum count S at position i % (N/C) to set i % (N/C) * C + j to 1 iff S >= c_j.

If you can figure out what I wrote here, well done! If not, see chemfp fpc2fps –help-methods for more details.

In the following you can see that size output bits are set, which corresponds to the total feature count of 2+2+1+1=6:

% printf '30:2,1517923320:2,2245900962,2962078081\tCO2\n' | \
    chemfp fpc2fps --num-bits 32 --rdkit --countBounds 1,2
#FPS1
#num_bits=32
#type=rdkit-count-sim/1 num_bits=32 countBounds=1,2
14000330      CO2

The “seq” method is for dense count fingerprints (that is, count fingerprints with at most a few hundred feature ids F, in the sequence 0, 1, ... F-1). The driving use case for “seq” encoding is a fingerprint type defined by a few dozen SMARTS patterns, where the count is the number of unique matches, and where the count is typically small (eg, under 20) and a where a relatively small upper bound (eg, 100) can be applied to each count.

These constraints mean the dense count fingerprint can be represented exactly as a binary fingerprint using unary encoding, which in turn means the generalized Jaccard index between two count fingerprints is identical to the Tanimoto similarity between two such binary fingerprints.

“Unary” encoding means that 1 distinct bit is set for each count, up to a maximum number W. For example, if W is 5 then the possible values, represented in binary, are “00000” for 0, “00001” for 1, “00011” for 2, “00111” for 3, “01111” for 4, and “11111” for 5 or more.

In the following, the first bin is 8 bits/1 byte long and contains the value 4, which is “00001111” or “0f” in hex, the second bin is 8 bits/1 byte long and contains the value 0 or “00” in hex, the third bin is 16 bits/2 bytes long and contains the value 3, or “00000000 00000111”, which is “0700” in chemfp’s hex fingerprint representation, the fourth bin is 16 bits/2 bytes long and contains the value 9 or “00000001 11111111”, which is “ff01”, while the remaining values are 0:

% printf '0:4,2:3,3:7\tABC\n' | \
    chemfp fpc2fps  --seq --sizes 8,8,16,16,8,8
#FPS1
#num_bits=64
#type=seq/1 num_bits=64 sizes=8,8,16,16,8,8
0f000700ff010000      ABC

The other two methods are “scaled” and “seq-scaled” which are like “superimpose” and “seq” except they support a conversion scale from the feature count to the actual count to use. This can be used to create a hybrid between the “superimpose” and “rdkit” methods or to approximate TF-IDF (Term Frequency-Inverse Document Frequency).

Use chemfp fpc2fps –help-methods for more details about these methods and their respective command-line options. Please contact me if you are researching alternative methods of count emulation, which I think is good research topic.

fps2fpc¶

The chemfp fps2fpc subcommand converts binary fingerprints from an FPS or FPB file into a count fingerprint in FPC format. Given a binary fingerprint, the corresponding feature fingerprint is * if the fingerprint is empty, or a comma-separated list of the on-bit indices, in increasing order.

% printf '04010045\tXYZ\n' | chemfp fps2fpc
#FPC1
#type=fps2fpc/1
2,8,24,26,30  XYZ

Klekota-Roth fingerprints¶

The chemfp 5.0 release includes fast implementations of the 4860-bit Klekota-Roth substructure fingerprint for the OpenEye and RDKit toolkits, available in oe2fps and rdkit2fps using --KlekotaRoth. The corresponding chemfp fingerprint types are “KlekotaRoth-OpenEye/1” and “KlekotaRoth-RDKit/1”.

Klekota and Roth in “Chemical substructures that enrich for biological activity”, Bioinformatics, Volume 24, Issue 21, November 2008, Pages 2518–2525, doi:10.1093/bioinformatics/btn479 used a decision tree to identify privileged substructures associated with biological activity. The paper lists these substructures as 4860 SMARTS patterns.

I read the paper and thought the approach was both novel and interesting , with plenty of opportunity for someone to revist it with new datasets. However, very few papers cite it, and the only implementation I could find was for the CDK.

In asking around, I learned that one stumbing block was its poor performance. A direct implementation which does 4860 SMARTS substructure tests per molecule can process about 47 molecules per second using RDKit, or about 6 hours per million molecules.

By comparison, rdkit2fps takes about 5.5 minutes to generate 1 million Morgan3 fingerprints, and 17 minutes to generate the 1 million 166 key MACCS fingerprints.

I developed a method which analyzes the 4860 fingerprints to find dependencies between them. For example, if the pattern 4842 (”S”) does not match, then there is no reason to test for pattern 1148 “[!#1][SH]”). The method includes a few additional fragments, like nn, which were found to improve the overall performance. It then genenerates a Python module which implements an optimized fingerprint generator.

The faster implementation for RDKit takes about 21.5 minutes to process a million fingerprints, which is about 17 times faster than the direct implementation, and only about 25% slower than the MACCS keys.

Note

If you develop fingerprints based on a large number of SMARTS strings and want to improve generation performance then contact me to see if my method is applicable.

New APIs¶

Chemfp 5.0 added several new API functions.

The function chemfp.get_metadata_from_source() reads the Metadata from a given source, which must be an FPS or FPB file (FPC is not currently supported). This method can read the metadata from a compressed FPB file without first decompressing the entire file.

>>> import chemfp
>>> chemfp.get_metadata_from_source("chembl_34.fpb.zst")
Metadata(num_bits=2048, num_bytes=256, type='RDKit-Morgan/1 radius=2
fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1',
aromaticity=None, sources=[], software='RDKit/2022.09.4', date=None)

The FingerprintArena.close() method now explicitly discards its reference to the internal fingerprint data storage, even for in-memory arenas. This makes the memory use lifetime more explicit, which was needed in shardsearch to ensure a large dataset was deallocated before loading a new large dataset. The areana context manager will close when exiting.

The new FingerprintArena.get_popcount_counts() method returns an array of length num_bits + 1 where element i contains the number of fingerprints with the popcount i. If the arena contains a popcount index (which requires the fingerprints to be in popcount order) then the array is computed directly from the index. If there is no popcount index then by default it computes the popcount for each fingerprint. If num_samples is an integer (and not None) then the counts come from sampling num_samples distinct fingerprints and computing their popcounts. Use seed to seed the underlying RNG.

>>> import chemfp
>>> arena = chemfp.load_fingerprints("maccs_50k1.fpb")
>>> len(arena)
50001
>>> arena.get_popcount_counts()
array('I', [0, 1, 0, 3, 6, 8, 12, 20, 9, 24, 24, 31, 33, 33, 48, 67,
62, 93, 92, 143, 164, 193, 247, 284, 328, 404, 441, 496, 534, 528,
583, 684, 704, 864, 845, 958, 1007, 1009, 1082, 1128, 1152, 1263,
1272, 1219, 1310, 1337, 1297, 1378, 1331, 1316, 1302, 1383, 1232,
1315, 1257, 1242, 1190, 1135, 1101, 1106, 1040, 961, 964, 940, 824,
759, 705, 642, 556, 534, 473, 409, 374, 324, 284, 250, 239, 216,
162, 191, 145, 112, 113, 81, 68, 74, 58, 47, 25, 20, 20, 20, 9, 8,
2, 4, 7, 6, 0, 3, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0])
>>> sum(arena.get_popcount_counts())
50001

The new SearchResult.get_indices_ids_and_scores() method returns a list of 3-element tuples containing the (index, id, score) for each hit, in the current order.

>>> import chemfp
>>> arena = chemfp.load_fingerprints("chembl_34.fpb")
>>> query_fp = arena.fingerprints[1_000_000]
>>> result = arena.knearest_tanimoto_search_fp(query_fp, k=4)
>>> result.get_indices_ids_and_scores()
[(1000000, 'CHEMBL4456699', 1.0),
 (923024, 'CHEMBL4462517', 0.631578947368421),
 (854991, 'CHEMBL4564372', 0.6140350877192983),
 (856777, 'CHEMBL4590721', 0.5081967213114754)]

The FingerprintType.get_type() method has a new complete parameter. By default get_type returns the canonical string for the fingerprint type, which might exclude some settings with default parameters. (This generally occurs for newly added parameters, to preserve compatibility with the older canonical string.) If complete is True then the method returns the full type string:

>>> import chemfp
>>> chemfp.rdkit.atom_pair.get_type()
'RDKit-AtomPair/3 fpSize=2048 minDistance=1 maxDistance=30
countSimulation=1'
>>> chemfp.rdkit.atom_pair.get_type(complete=True)
'RDKit-AtomPair/3 fpSize=2048 minDistance=1 maxDistance=30
countSimulation=1 includeChirality=0 use2D=1 fromAtoms=None'

CDK¶

Chemfp 5.0 added support for new features in CDK 2.10 and 2.0.

Added support for the square planar, octahedral, and trigonal bipyramidal SmiFlavor flags added in CDK 2.10. They are ignored when using older versions of CDK.

Added version “2.10” Daylight, GraphOnly, and Hybridization fingerprinters. They support the “hashExplicitHydrogens” option, with a default of 0. (Explicit hydrogens are ignored.) The older version “2.0” versions always included explicit hydrogens.

Added version “2.10” ExtendedFingerprinter. Internally it uses a DaylightFingerprinter with hashExplicitHydrogens=0. The CDK API does not have a way to configure this value.

Added a “prepare” option to the CDK SMILES, SDF, molfile, and InChI readers. If true (the default) it identifies rings and perceives aromaticity.

Other¶

Chemfp 5.0 added support for Python 3.13, NumPy 2.0+, and click 8.2.

There are also some performance improvements with FPS parsing and FPB generation.

FIX: FPB generation can build larger-than-RAM datasets by saving partial data to temporary files (a “spool”) then merging the results. The estimator for the spool size was incorrect, which would limit spool sizes to under 1 GB. The estimator has been fixed.

Added specialized byte Tanimoto implementations for different fingerprint sizes. These is an internal detail which improves simhistogram performance, but does not affect the API. Use “report-byte-tanimoto” option to enable/disable selection reporting of the chosen implementation:

>>> import chemfp
>>> arena = chemfp.load_fingerprints("chembl_34.fpb")
>>> chemfp.simhistogram(targets=arena, progress=False)
SimHistogramResult(<sampled upper-triangle, 25000 pairs (0.0000%),
seed=1375723649, 100 bins>, times=<total: 2.37 s>)
 >>> from chemfp import bitops
>>> bitops.set_option("report-byte-tanimoto", 0)
>>> chemfp.simhistogram(targets=arena, progress=False)
Byte tanimoto method: size256 (popcnt_128_128) num_bits: 2048
arena1: 0x10b5d9100 (128 byte aligned) storage_len1: 256
arena2: 0x10b5d9100 (128 byte aligned) storage_len2: 256
SimHistogramResult(<sampled upper-triangle, 25000 pairs (0.0000%),
seed=3331938674, 100 bins>, times=<total: 657.78 ms>)

FIX: Fixed a progress bar bug when the input is an FPB file. Thanks to Pavel Polishchuk who reported this bug.