What’s new in chemfp 5.1
Version 5.1 was released on 2 April 2026.
The main additions to chemfp 5.1 are:
rdkits2ps’s countSimulation parameter supports a new “superimposed” value, which is better at approximating the count Tanimoto than RDKit’s built-in count simulation. [More …]
New EState, Gobbi2D, and LINGO fingerprint types. [More …]
There is a new count fingerprint API with its own CountFingerprint data type. The API is public, though still experimental. [More …]
There are minor changes to the FPC format, and to the fpc2fps converter. [More…]
Support for Python 3.14. [More …]
See below for more details.
Background
Most of the work for this release went into writing the paper “Superimposed Coding of Count Fingerprints to Binary Fingerprints.” The paper shows that if random superimposed coding is used to convert count fingerprints to binary fingerprints then a Tanimoto nearest-neighbor search of the binary fingerprints is a good approximation of the multset Tanimoto of the original count fingerprints, and better than using RDKit’s count simulation method. (The multset Tanimoto is also referred to as the MinMax kernel, among others.)
Information from that paper has informed changes and improvements to this release of chemfp.
That research started when I was writing the documentation for the chemfp fpc2fps subcommand added to chemfp 5.0. It implements several methods to convert sparse count fingerprints into binary. I noticed that the “superimposed” method was better than RDKit’s “countsim” at approximating the count Tanimoto.
After the chemfp 5.0 release, I explored this topic in more detail. Among other things, I learned that there are two forms of the count Tanimoto – the vector Tanimoto from the 1980s and the multiset Tanimoto from the early 2000s. Of the two, I prefer the multiset Tanimoto, which is what RDKit (and now chemfp) implements.
We can interpret count simulation as way to convert a count fingerprint to a byte fingerprint such that the Tanimoto nearest-neighbor search of the byte fingerprint acts as an approximate nearest-neighbor search of the original count fingerprints. The k-recall@k score is a way to compare different approximatation method. Given the k-nearest neighbors in the approximate (binary) Tanimoto search, which fraction are also k-nearest neighbors in the exact (count) search?
I found that the recall rate for superimposed conversion was higher (in the 0.95 range) than RDKit’s count simultation (around 0.85) or simple folding (around 0.8). The chemfp 5.1 release is very much informed by the results of that research.
I also found that setting more than one bit per feature was not useful for normal bit densities. That is, using 2 bits per feature gave worse recall than 1 bit per feature. This can be compensated by doubling the number of bits, in which case the recall is slightly better. However, doubling the fingerprint size while using 1 bit per feature was better still.
This strongly suggests that RDKit’s default numBitsPerFeature=2 for the RDKit fingerprint should instead be numBitsPerFeature=1.
A paper about research has been submitted to the Journal of Cheminformatics. A preprint is available as “Superimposed Coding of Count Fingerprints to Binary Fingerprints” from ChemRxiv.
Backwards-incompatible changes
Warning
Changed default value of numBitsPerFeature in rdkit2fpc.
The default value of numBitsPerFeature for the RDKit fingerprint generator is 2. The research described above strongly suggests that setting more than 1 bit per feature does not improve similarity. Therefore, in chemfp rdkit2fpc, the default value for numBitsPerFeature has been changed from 2 to 1. The numBitsPerFeature value is now always included in the metadata to reduce possible confusion between the RDKit and chemfp defaults.
For backwards-compatibility, the default value in rdkit2fps has not changed.
NOT YET: The chemfp 4.2 release notes mention that chemfp 5.0 will change so that progress bars aren’t shown unless there is a certain minimum delay. This change has not been done, but may in the future.
NOT YET: The chemfp 4.2 release notes mention that the “npz” JSON metadata array format “will likely change”, to be more consistent with then-new simarray JSON. This has not yet happened but may in the future.
rdkit2fps “superimposed” count simulation
The rdkit2fps --countSimulation command-line
parameter now accepts “superimposed” as a value, in addition to
“0” (to do no count simulation), and “1” for RDKit’s count simulation
method.
Note
A valid chemfp license key is required to generate more than 50,000 superimposed count fingerprints in a single process.
This option is available for the RDK, Morgan, AtomPair and Torsion fingerprints.
% echo "c1ccccc1O phenol" | rdkit2fps --fpSize 128 --no-date
#FPS1
#num_bits=128
#type=RDKit-Morgan/2 fpSize=128 radius=3 useFeatures=0
#software=RDKit/2025.09.4 chemfp/5.1
20800008808000000700420010020400 phenol
%
% echo "c1ccccc1O phenol" |
rdkit2fps --fpSize 128 --countSimulation superimposed --no-date
#FPS1
#num_bits=128
#type=RDKit-Morgan/2 fpSize=128 radius=3 useFeatures=0 countSimulation=superimposed
#software=RDKit/2025.09.4 chemfp/5.1
26040850080a824c0800108001181000 phenol
These fingerprints set respectively 13 and 22 bits to 1. The corresponding count fingerprint using chemfp rdkit2fpc, and with a newline added for formatting, is:
% echo "c1ccccc1O phenol" | chemfp rdkit2fpc --numBitsPerFeature 2 --no-date
#FPC1
#num_bits=18446744073709551615
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2025.09.4 chemfp/5.1
26234434,98513984:3,251179073:2,742000539,859799282,864662311,951226070:2,
2763854213,2905660137,2985234959,3217380708,3218693969:5,3999906991:2 phenol
which has 13 distinct features and a total count of 22, which matches the number of bits set to 1 in the binary fingerprints.
Note that rdkit2fpc requires using --numBitsPerFeature 2 to match
the fingerprints generated by rdkit2fps as the default value for those
two programs are different!
In case you are wondering, the RDKit count simulation generates a fingerprint with 18 bits, which is in between the other two methods:
% echo "c1ccccc1O phenol" | rdkit2fps --fpSize 128 --countSimulation 1 --no-date
#FPS1
#num_bits=128
#type=RDKit-Morgan/2 fpSize=128 radius=3 useFeatures=0 countSimulation=1
#software=RDKit/2025.09.4 chemfp/5.1
33011110100000307001000300100000 phenol
count fingerprint API
This release introduces an API for working with count fingerprints. It is still in development and subject to change, but is stable enough for the bold and the curious.
In chemfp 5.0 a count fingerprint was stored as a Python
dictionary. This was not part of the public API. With chemfp 5.1 it is
now its own CountFingerprint data type in the
chemfp.countops module, and available for wider use. Each count
fingerprint feature has a 64-bit unsigned index and a 32-bit unsigned
count. A fingerprint may have at most 1 million features.
One consequence of switching from a Python dictionary to a C data type is that reading an FPC file is twice as fast. This, combined with other optimizations, has lead to a 2-3x performance improvement in fpc2fps when using the “fold” or “superimpose” methods.
To read count fingerprints, there are new functions in the
chemfp module. Use read_count_fingerprints() to read
from a file, and use read_count_fingerprints_from_string() to
read from a string.
>>> import chemfp
>>> reader = chemfp.read_count_fingerprints("Morgan3/sparse.fpc")
>>> next(reader)
('CHEMBL440245', CountFingerprint(#features=245))
>>> id, fp = _
>>> len(fp)
245
>>> fp[0]
(8819703, 1)
>>> list(fp)[:3]
[(8819703, 1), (10565946, 6), (21411075, 1)]
The chemfp.countops module contains functions to work with
count fingerprints.
There are functions to parse an FPC-encoded string into a
CountFingerprint, and vice versa, or to create the CountFingerprint from
features.
There are functions to work with the binary string from RDKit’s UInt
and ULong count fingerprints, including to parse the binary, to convert a CountFingerprint to an RDKit
UIntSparseIntVect or
ULongSparseIntVect
and to extract header information from the RDKit binary.
Here are some examples:
>>> from chemfp import countops
>>> fp = countops.CountFingerprint.from_features(
... [(1, 5), (8, 3), (10, 1)])
>>> str(fp)
'1:5,8:3,10'
>>> from rdkit import DataStructs
>>> rdk_fp = DataStructs.UIntSparseIntVect(
... countops.create_rdkit_binary_UIntSparseIntVect(fp, 128))
>>> rdk_fp
<rdkit.DataStructs.cDataStructs.UIntSparseIntVect object at 0x7d5a597575b0>
>>> dict(rdk_fp)
{1: 5, 8: 3, 10: 1}
You can also compute the Tanimoto between two count fingerprints.
>>> id1, fp1 = next(reader)
>>> id2, fp2 = next(reader)
>>> score = countops.count_tanimoto(fp1, f2)
>>> score = countops.count_tanimoto(fp1, fp2)
>>> print(f"score between {id1} and {id2}: {score}")
score between CHEMBL440249 and CHEMBL503643: 0.03484848484848485
as well as between two count dictionaries:
>>> countops.dict_tanimoto({1: 3, 2: 4}, {1: 2, 4: 1})
0.25
>>> countops.dict_tanimoto(dict(fp1), dict(fp2))
0.03484848484848485
If you are interested in working with count fingerprints directly, and would like to experiment with and provide feedback on the API, please contact me.
EState, LINGO, and Gobbi2D fingerprints
I’ve added three new fingerprint types to chemfp, in part due to the successful validation of superimposed count coding, and in part as a way test the usefulness of the count fingerprint API. These are the EState fingerprints, the Gobbi2D fingerprints, and the LINGO fingerprints.
EState fingerprints
The documentation contains more details about generating EState count fingerprints with RDKit and generating superimposed EState byte fingerprints with RDKit or OEChem.
The EState count fingerprints are based on the substructure patterns from Hall and Kier, “Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information”, J. Chem. Inf. Comput. Sci. (1995) 35, pp 1039-1045, doi: 10.1021/ci00028a014, and more specifically derived from the 79 SMARTS pattern used in the RDKit, with modifications to make them more appropriate for atom typing and more portable across chemistry toolkits.
The EState patterns only used for atom typing. The individual and summed EState values are not computed. The output is a fingerprint with up to 79 indices, where count at index i is the number of atoms matched by the i’th SMARTS pattern matches, with a maximum upper count of 1,000.
The --estate option for the chemfp rdkit2fpc command uses RDKit to generate the EState count
fingerprints in FPC format. The new type name is
“EState-RDKitCount/1”. It takes no parameters.
% echo "CC(=O)C XYZ" | chemfp rdkit2fpc --estate --no-date
#FPC1
#num_bits=79
#type=EState-RDKitCount/1
#software=RDKit/2025.03.3 chemfp/5.1
6:2,15,34 XYZ
The --estate option for the rdkit2fps command
uses RDKit to generate superimposed byte fingerprints using the EState
count fingerprints. The new type name is
“EStateSuperimposed-RDKit/1”. It supports the “fpSize” parameter,
which is the number of bits in the output byte fingerprint. The
default fpSize is 2048. An API shortcut to the byte fingerprint type
is available from the chemfp.rdkit_toolkit module as
“estate_superimposed”.
% echo "CC(=O)C XYZ" | rdkit2fps --estate --fpSize 128 --no-date
#FPS1
#num_bits=128
#type=EStateSuperimposed-RDKit/1 fpSize=128
#software=RDKit/2025.03.3 chemfp/5.1
00080000008000000001000000000200 XYZ
The --estate option for the oe2fps command uses
OpenEye’s OEChem to generate superimposed byte fingerprints using the
EState count fingerprints. The new type name is
“EStateSuperimposed-OpenEye/1”. It supports the “numbits” parameter,
which is the number of bits in the output byte fingerprint. The
default number is 2048. An API shortcut to the byte fingerprint type
is available from the chemfp.openeye_toolkit module as
“estate_superimposed”.
% echo "CC(=O)C XYZ" | oe2fps --estate --numbits 128 --no-date
#FPS1
#num_bits=128
#type=EStateSuperimposed-OpenEye/1 numbits=128
#software=OEChem/4.3.0.1 (20251120) chemfp/5.1
00080000008000000001000000000200 XYZ
The low-level EState atom typing functions are not yet part of the public API. Please contact me if you are interested in using them and can provide feedback.
Gobbi2D
The documentation contains more details about generating Gobbi2D count fingerprints with RDKit.
These are 2D pharmacophore fingerprints generated by RDKit’s Pharm2D module and its definitions for the pharmacophores in A. Gobbi and D. Poppinger, “Genetic optimization of combinatorial libraries”. Biotechnol. Bioeng., (1998), 61, pp. 47-54. doi:10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z.
% echo "NCCCC(=O)C ABC" | chemfp rdkit2fpc --gobbi2d --no-date
#FPC1
#num_bits=39972
#type=RDKit-Gobbi2DCount/1 minPointCount=2 maxPointCount=3 includeBondOrder=0 trianglePruneBins=1
#software=RDKit/2025.03.3 chemfp/5.1
115,150,157 ABC
These are interpretable using a ChemicalFeatures factory set up with the same parameters. In the default case above, this is easily available:
>>> from rdkit.Chem.Pharm2D import Gobbi_Pharm2D
>>> Gobbi_Pharm2D.factory.GetBitDescription(115)
'BG HA |0 3|3 0|'
>>> Gobbi_Pharm2D.factory.GetBitDescription(150)
'HA HA |0 3|3 0|'
>>> Gobbi_Pharm2D.factory.GetBitDescription(151)
'HA HA |0 4|4 0|'
It means there are three 2-point pharmacophores, with “BG” short for
“Basic Group”, “HA” short for “Hydrogen bond Acceptor”, and the
remainder of the string the binned distances in the distance matrix
for each pharmacophore point. See the fdef string in
Gobbi_Pharm2D
for the SMARTS definition for each pharmacophore.
LINGO fingerprints
The documentation contains more details about generating LINGO count fingerprints and generating superimposed LINGO byte fingerprints, and and example of superimposed LINGO similarity search of ChEMBL.
The LINGO fingerprints are based on the work of Vidal, Thormann, and Pons, “LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities”, J. Chem. Inf. Model. (2005) 45, 386-494, doi:10.1021/ci0496797.
Note
A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.
The expected input is a SMILES string, in canonical form. This is used to generate nmer substrings called a k-LINGO, where k is the size. If the input is “CC=O” and k=1 then there are 2 “C”, 1 “=”, and 1 “O” 1-LINGO terms. With k=2 then the 2-LINGO terms are 1 “CC”, 1 “C=”, and 1 “=O”.
To convert a LINGO to the 64-bit feature integer in chemfp’s count fingerprints, the LINGO byte value are used directly, based on the byte value of each position. The LINGO “C” is 67 because 67 is the ASCII value of “C”, and the LINGO “C=” is 17213, which is 67*256+61 where 61 is the ASCII value of “=”.
The new command-line tool chemfp lingo2fpc converts a SMILES file to LINGO count fingerprints. For examples, see the chemfp lingo2fpc introduction.
There are a few ways to generate a similarity score for LINGO fingerprints. The Vidal paper used the integral count Tanimoto to compare two LINGO holograms. Grant, Haigh, Pickup, Nicholls, and Sayle, “Lingos, Finite State Machines, and Fast Similarity Searching “, J. Chem. Inf. Model. (2005) 46, pp 1912-1918, doi:10.1021/ci6002152 suggested using the multiset/count Tanimoto.
Chemfp uses superimposed coding to convert count fingerprints to binary fingerprints, such that a Tanimoto similarity search of the binary fingerprints is a good approximation to the count Tanimoto of the original count fingerprints.
The new command-line tool chemfp lingo2fps converts a SMILES file to superimposed LINGO byte fingerprints. For examples of how to generate these fingerprints, see the Generating LINGO byte fingerprints section. The Superimposed LINGO similarity search section includes an example of how to search ChEMBL 36.
The input SMILES strings are not necessarily mapped directly to LINGO nmers. Vidal et al. proposed two normalizations: 1) convert all ring closure terms to “0”, and 2) convert “Br” to “R” and “Cl” to “L”. Grant et al. used the first of these but not the second.
Chemfp supports both normalizations but only does closure normalization by default. It also has options to select a different nmer lengths. These details are stored in the type field of the output fingerprint metadata. Follow the above links for more details.
FPC format and fpc2fps conversion changes
The FPC format specification has been changed so that a count of 0 is not allowed. Chemfp 5.0 supported it. That option was removed to make it easier to count the number of non-zero features in the fingerprint. If the fingerprint field is “*” then there are no features, otherwise it is one more than the number commas.
The specification now specifically includes “num_bits” in the metadata. In the FPS format this means the number of bits in the binary fingerprint. In the FPC format this means the number of feature indices in the count fingerprint. This is needed because each RDKit fingerprints also stores the total number of possible bits, and two fingerprints can be compared if and only if they have same size. Without this information it is impossible to round-trip RDKit fingerprints to a file.
Chemfp has been updated to read and write the num_bits. Note that RDKit’s UIntSparseIntVect uses 2**32-1 to mean 2**32 bits, and RDKit’s ULongSparseIntVect uses 2**64-1 to mean 2**64 bits. These are special cases in RDKit. Chemfp does not currently adjust the size to reflect the actual number of bits possible.
The “fold” count conversion method in fpc2fps now supports a
--hash option. The distribution of RDKit topological torsion
indices after modulo folding are poorly distributed. This is because
the first 9 bits (or 11 bits with --includeChirality 1) contain
type information for the first atom, the second 9 (or 11) bits
contains the type information for the second, and so on. Modulo
folding at 1024 bits means that only the first 10 bits are used. For
topological torsions this means only the first atom type and perhaps
one bit of the second type are used – the rest are ignored! Instead,
with --hash 1 the index is used to seed a pseudo-random number
generator to generate the value to hash. This gives a much better
distribution of hashed values.
The “superimpose” count conversion in fpc2fps now uses a maximum
count, with a default 1000, as an upper-bound for the feature
count. Without it, a fingerprint like “1:4294967295” caused the
converter to take a very long time to generate a byte fingerprint with
all bits set to 1. Use --max-count to change the size.
Note
While the superimpose conversion method in chemfp requires a valid license key to generate more than 50,000 fingerprints in the same process, there is a special exception for fpc2fps which allows unlimited conversion.
The “rdkit-count-sim” method name in fpc2fpc was changed to “rdkit-countsim”, to match the use in the paper.
Other changes
A change in RDKit 2025.09.4 caused some of the SECFP fingerprints to change. Chemfp refers to these new fingerprints as “RDKit-SECFP/2” and the older ones as “RDKit-SECFP/1”.
Chemfp 5.0 added support for Python 3.14, NumPy 2.4, and click 8.2. The changes were minimal and limited to the test suite. The “license” field in pyproject.toml file was updated to reflect a breaking change to the format.
Chemfp 5.1 is the last version of chemfp to support Python 3.9, which reached its end-of-life (that is, support by the Python core developers) on 31 October 2025.
If the environment variable CHEMFP_CYTHONIZE is “1” then setup.py will always cythonize the .pyz files to c, instead of using timestamps. This is useful if you’ve upgraded Cython and want to rebuild the cythonized C extensions.
The FPC parser in chemfp 5.0 used terms like “bit number” for the feature index. This was visible in the error messages like “Sparse bit number must be < 2**64”. These have been replaced with “feature index”, such as “Feature index must be < 2**64”.
Finally, there are a number of bug fixes, corrected typos, and other small changes that aren’t worth noting.