Command-line examples for sparse count fingerprints¶
The sections in this chapter show examples of using the chemfp command-line tools to generate and work with sparse count fingerprint files. Examples of using the command-line tools for binary fingerprints are in its own chapter.
Chemfp 5.0 added initial support for sparse count fingerprints. Previous versions only supported binary fingerprints. Even with version 5.0 sparse count fingerprint support is focused on command-line tools to generate and convert sparse count fingerprints to binary fingerprints.
In particular, chemfp 5.0 does not support direct similarity search of sparse count fingerprints. See below for examples of how to convert them to binary fingerprints for use with simsearch.
This chapter will start with the basics of count fingerprints, the FPC format, and some details about count fingerprint similarity before getting to examples of the command-line tools for generating count fingerprints with RDKit, converting count fingerprints to binary, and converting binary fingerprints to count.
Sparse fingerprints quick start¶
In this section you’ll learn how to generate Morgan count fingerprints for ChEMBL 36, convert them to binary fingerprints using the “superimpose” method, and search the results with simsearch. You will need a copy of chembl_36.sdf.gz from the ChEMBL 36 release.
This is a quick overview. The rest of this chapter goes into the details.
Chemfp 5.0 has initial support for sparse fingerprints. You can use chemfp rdkit2fpc to generate sparse count fingerprints in FPC format and chemfp fpc2fps to convert sparse count fingerprints to binary fingerprints, which can then be used for similarity search, clustering, and the other chemfp components and APIs.
ChEMBL 36 came out a couple of days ago, as I write this documentation. I’ll work with that dataset, which means downloading it:
% curl -LO https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_36/chembl_36.sdf.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 893M 100 893M 0 0 2053k 0 0:07:25 0:07:25 --:--:-- 2128k
The chemfp rdkit2fpc command uses RDKit to generate sparse count fingerprints using one of the four available RDKit sparse fingerprint generators <chemfp_rdkit2fpc_intro>. The default generates Morgan fingerprints with radius 3. I’ll save the output to “chembl_36_morgan3.fpc”.
% chemfp rdkit2fps chembl_36.sdf.gz -o chembl_36_morgan3.fpc
chembl_36.sdf.gz: 100%|██████████| 936M/936M [15:30<00:00, 1.01Mbytes/s]
(I omitted the many RDKit warning messages.)
The FPC file starts with a header and then text-encoded sparse fingerprint records:
% head chembl_36_morgan3.fpc
head chembl_36_morgan3.fpc | head -6 | fold -w75
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#source=chembl_36.sdf.gz
#date=2025-09-23T12:05:42+00:00
8819703,10565946:6,21411075,74537039:3,98513984:3,106928295,155357245,20080
3840:2,202916677,203696813:2,205091775,209539530,275844649:3,348315680,3627
36814,416356657,426185235,481036053:3,552744685,584893129:2,588615134,60371
2107,614696445,632355422,705633911:2,725338437,732650486,733416567:4,739682
264:2,742000539,751208592:3,765252582,786270123:2,787341104,831463242,84020
3949,847336149:3,847957139:13,847961216:24,851463431:2,861569459:2,86466231
1:3,864942730:38,867780956,868515887:2,881868678:3,891329428:2,898782895,90
4435074,912632640:2,919730633,945676240,951226070:2,951664601,963360736,993
357377:2,1022303596:2,1064745320,1067601182,1091317056:2,1092185181,1100037
548:2,1112379768,1135998617:2,1141162250:4,1143249862,1156835409,1160986729
:6,1167322652:5,1167510932,1173125914:2,1177897542:2,1178235770,1182622762:
2,1241519671,1248171218,1256884789:4,1259710714:2,1276993226,1288546412:2,1
355146676:2,1362518133:23,1368548858:3,1453891735,1475008087:2,1506563592:2
,1506993418,1510328189:32,1510461303:7,1514288082:3,1533864325:2,1534122513
,1535166686,1582607016:5,1583799011:9,1618787606,1618797937:4,1685248591:2,
1693331843,1732175000,1739265633:2,1750892114:9,1754284222,1797524254:2,184
0891614:2,1842658656,1845470297,1858577693:2,1868611658,1884205411:2,189945
4565:2,1927060881,1931762473,1960485383,1962383056:3,2007823391:2,201508900
2,2015594738:2,2028220934,2031636797,2041434490:2,2056290811,2089752391,209
5932647,2096521477,2100423615,2117068077:2,2132511834:8,2133811111:5,214203
2900:2,2150151661,2191273952:17,2221177404,2225186000:2,2231929377:3,224527
3601:32,2245384272:41,2246699815:35,2246728737:22,2259647190:5,2264318846,2
289501320:2,2299784278:17,2319182106,2333272823:5,2343062145,2361265913:2,2
405469776:2,2423543607:5,2442018706,2448572767:3,2498288868:2,2520548245,25
34441460,2591432844:11,2592785365,2599973650,2604604876,2637439965:2,264906
3844,2654043257:2,2666857930:5,2678918872:3,2684694360,2697110228:2,2752034
647:2,2763854213,2806018737,2832976762,2843304388,2863307117:2,2867688340:2
,2892360967,2929652889,2939120473,2940239131,2941064859:2,2964009977,296896
8094:6,2976033787:6,2981789305,2989341071:7,3004333805,3008901642:3,3099695
679:2,3135357859:2,3143227007,3147457595,3149497025:2,3182041044:3,31821772
90:5,3201831218,3203925050,3217380708:9,3218693969:9,3234104871,3241680715,
3261096889,3262357651,3272226737:3,3284700855:2,3296404462:2,3303793604,331
4130824,3317330686,3328145258:4,3332377904,3338734523:2,3344893792,33661738
70,3392469258,3447215649:2,3466781987,3506165101,3510196525:5,3537119515:16
,3542456614:4,3556458277:2,3561287756,3684238839,3718957586,3791102067:2,38
21303249,3850635461:2,3916616716:5,3969756582,3985977119:2,3999906991:2,402
2716898,4031920000,4037464357,4057379760,4066851934:3,4070780135,4078658161
:2,4081006743,4086993724,4121755354,4124858218:4,4181883701,4222851645:23,4
223976160:18,4264485148,4274980665,4278941385:2 CHEMBL440245
Chemfp 5.0 does not support direct sparse count fingerprint similarity search, only an indirect search after converting the sparse count fingerprints to binary fingerprints and searching the binary fingerprints.
The chemfp fpc2fps command implements several methods to convert the sparse count fingerprints to (dense) binary fingerprints. The default is “superimpose, which uses superimposed coding to distribute the feature id and its counts across the fingerprint.
% chemfp fpc2fps chembl_36_morgan3.fpc -o chembl_36_morgan3_superimpose.fpb
chembl_36_morgan3.fpc: 100%|███████| 2.37G/2.37G [00:36<00:00, 65.8Mbytes/s]
The result is an FPB file with fingerprint sorted by increasing popcount (the fingerprints with the fewest on-bits come first).
% fpcat chembl_36_morgan3_superimpose.fpb | head -5 | fold -w75 -s
#FPS1
#num_bits=2048
#type=RDKit-MorganCount/2 radius=3 useFeatures=0 | superimpose/1
num_bits=2048
#software=RDKit/2024.09.5 chemfp/5.0b2
000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000004000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000
CHEMBL4300465
Search requires going through the same steps to generate binary query fingerprints before using them to search the targets. The following creates a SMILES file with two record then uses a pipeline to generate the two corresponding binary fingerprints, in FPB format, for a k=10 nearest-neighbor similarity search.
% cat > queries.smi
c1ccccc1O phenol
C1=CC=C(C=C1)NC(=O)C2=CC(=CC(=C2)N)C(=O)NC3=CC=CC=C3 PubChem_C617508
% chemfp rdkit2fpc queries.smi | \
chemfp fpc2fps | \
simsearch chembl_36_morgan3_superimpose.fpb -k 10 --out csv
query_id,target_id,score
phenol,CHEMBL14060,1.0000000
phenol,CHEMBL16068,0.5172414
phenol,CHEMBL16070,0.5172414
phenol,CHEMBL116296,0.5172414
phenol,CHEMBL119405,0.5172414
phenol,CHEMBL5186653,0.5172414
phenol,CHEMBL9113,0.5172414
phenol,CHEMBL538,0.5172414
phenol,CHEMBL16200,0.5172414
phenol,CHEMBL2107497,0.5000000
PubChem_C617508,CHEMBL4526222,0.8061224
PubChem_C617508,CHEMBL4534018,0.7722772
PubChem_C617508,CHEMBL4571777,0.7647059
PubChem_C617508,CHEMBL4555047,0.7500000
PubChem_C617508,CHEMBL4557884,0.7428571
PubChem_C617508,CHEMBL4465008,0.6730769
PubChem_C617508,CHEMBL188775,0.6476190
PubChem_C617508,CHEMBL4438910,0.6393443
PubChem_C617508,CHEMBL1378530,0.6074766
PubChem_C617508,CHEMBL2158387,0.6036036
We can compare the search results with the folded Morgan3 fingerprints (folding ignores the
feature count) by converting the existing count fingerprints rather
than re-processing the structures. These fingerprints are identical to
the default (--morgan3) binary fingerprints generated
by rdkit2fps.
% chemfp fpc2fps -m fold chembl_36_morgan3.fpc -o chembl_36_morgan3_fold.fpb
chembl_36_morgan3.fpc: 100%|██████████| 2.37G/2.37G [00:45<00:00, 52.5Mbytes/s]
% chemfp rdkit2fpc queries.smi | \
chemfp fpc2fps -m fold | \
simsearch chembl_36_morgan3_fold.fpb -k 10 --out csv
query_id,target_id,score
phenol,CHEMBL14060,1.0000000
phenol,CHEMBL73380,0.4400000
phenol,CHEMBL537,0.4375000
phenol,CHEMBL24147,0.4210526
phenol,CHEMBL3769799,0.4166667
phenol,CHEMBL495708,0.3928571
phenol,CHEMBL280998,0.3888889
phenol,CHEMBL3039845,0.3809524
phenol,CHEMBL79759,0.3793103
phenol,CHEMBL224318,0.3793103
PubChem_C617508,CHEMBL4438910,0.7567568
PubChem_C617508,CHEMBL4526222,0.6363636
PubChem_C617508,CHEMBL475680,0.6000000
PubChem_C617508,CHEMBL39258,0.5957447
PubChem_C617508,CHEMBL4534018,0.5957447
PubChem_C617508,CHEMBL4557884,0.5957447
PubChem_C617508,CHEMBL4571777,0.5957447
PubChem_C617508,CHEMBL4555047,0.5833333
PubChem_C617508,CHEMBL277484,0.5777778
PubChem_C617508,CHEMBL394811,0.5714286
RDKit’s binary fingerprint generator also implements count simulation, which is a different way to incorporate counts in the generated fingerprints. This is avaialble as rdkit2fps \\-\\-countSimulation=1, or by converting the FPC file using the “rdkit-count-sim” (or “rdkit” for short) method:
% chemfp fpc2fps -m rdkit chembl_36_morgan3.fpc -o chembl_36_morgan3_count_sim.fpb
chembl_36_morgan3.fpc: 100%|██████████| 2.37G/2.37G [02:30<00:00, 15.8Mbytes/s]
% chemfp rdkit2fpc queries.smi | \
chemfp fpc2fps -m rdkit | \
simsearch chembl_36_morgan3_count_sim.fpb -k 10 --out csv
query_id,target_id,score
phenol,CHEMBL14060,1.0000000
phenol,CHEMBL5186653,0.4800000
phenol,CHEMBL320474,0.4642857
phenol,CHEMBL16068,0.4615385
phenol,CHEMBL16070,0.4615385
phenol,CHEMBL116296,0.4615385
phenol,CHEMBL119405,0.4615385
phenol,CHEMBL9113,0.4615385
phenol,CHEMBL538,0.4615385
phenol,CHEMBL16200,0.4615385
PubChem_C617508,CHEMBL4438910,0.7763158
PubChem_C617508,CHEMBL4526222,0.7375000
PubChem_C617508,CHEMBL4555047,0.7023810
PubChem_C617508,CHEMBL4534018,0.7023810
PubChem_C617508,CHEMBL4571777,0.7023810
PubChem_C617508,CHEMBL4557884,0.6860465
PubChem_C617508,CHEMBL4465008,0.6385542
PubChem_C617508,CHEMBL188775,0.6219512
PubChem_C617508,CHEMBL5009859,0.6210526
PubChem_C617508,CHEMBL1378530,0.5697674
I personally think the superimpose method is better at approximating the original count fingerprints than both folded fingerprint and RDKit’s “count simulation”, but that analysis is preliminary.
What are sparse count fingerprints?¶
In this section you’ll learn about the basics of sparse count fingerprints and the chemfp nomenclature.
Chemfp started as a way to promote the FPS as a text-based exchange format for binary fingerprints. It later added the FPB format which is faster to load because it was designed to be read by machines, not humans.
By “binary fingerprint” I mean a fixed-length list of bits which are either 0 or 1 for a given bit position, and where the list is usually no more than a few thousand possible bit positions.
A count fingerprint uses an non-negative integer count for each position rather than a binary value.
For example, we know of 118 elements, so an element count fingerprint for a molecule could store the number of carbons at the sixth position, the number of oxygens at the eighth, and so on. A molecule like dibromoboranylformonitrile with molecular formula C14H29BBr2N2 might be represented as 29,0,0,0,1,14,2,0,…2,…0, starting with the 29 hydrogens.
These vectors are long, and most of those values are 0, which is why we use the more compact Hill system to represent the molecular formula with only the elements which are present, mapped to their count.
In chemfp using, a count fingerprint contains zero or more features, where each feature has a feature id (a non-negative integer) and a feature count (also a non-negative integer). For example, feature id 6 might be the number of carbon atoms in a molecule, and position 8 might be the number of oxygen atoms.
In practice, most count fingerprints are sparse, that is, the feature count for most feature ids for a record are zero. For example, RDKit’s sparse Morgan fingerprint generator can generate feature ids between 0 and 2⁶⁴-1 but when I processed ChEMBL 33 I found the average number of features per record was 71 (median of 68, maximum of 451), the average feature count was 1.5 (median of 1, maximum of 252), and the total feature count per record was 105 (median of 98, maximum of 2424).
All of these numbers are far, far smaller than the 18 quintillion of 2⁶⁴, making for a very sparse fingerprint.
I tend to use “count fingerprint”, “sparse fingerprint” and “sparse count fingerprint” interchangably, but that’s not quite correct. The difference between dense and sparse count fingerprints is mostly an implementation detail which affects memory use and performance. If you have short, dense, fixed-length fingerprints then you might look at vector search tools, which are optimized for this case.
When you see “count fingerprints” in this documentation, assume I mean sparse count fingerprint unless I specifically say “dense count fingerprints”.
The FPC format¶
In this section you’ll learn about the FPC format.
The FPC format is text-based exchange format for sparse count fingerprints. Here is an example:
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-09-09T08:43:12+00:00
864666390:6,3753451792,3855292234:2 CHEMBL3185229
* empty
It is a line-oriented format with a header followed by sparse count fingerprint record lines. The header lines all start with “#”. The record lines never start with “#”. The header contains an optional format identification line followed by 0 or more metadata lines with information about the fingerprints.
Each header line is a “#name” = “value” pair. Some names has specified meanings, like how “#type” describes the fingerprint type.
Each record line contains tab-separated columns. The first column contains the fingerprint and the second the identifier. Any additional columns are currently ignored in chemfp.
The FPC format almost identical to the FPS for binary fingerprints except the
format identification line #FPS1 is replaced by #FPC1, the
number of bits metadata line (#num_bits) is not used, and the
fingerprint field uses a different encoding method.
If a fingerprint has at least one feature then the fingerprint is encoded as comma-separated sequence of encoded features, ordered by increasing feature id. Each feature is encoded as the feature id followed by a colon followed by the count. The colon and count can be omitted if the count is 1.
In the above example, CHEMBL3185229 has three features:
feature 864666390 occurs six times;
feature 3753451792 occurs once (which is why the “:1” is not needed);
feature 3855292234 occurs twice.
If the fingerprint has no features then it is encoded as “*”, as shown in the second record above, which has an id of “empty” because I generated it using input SDF structure containing no atoms.
The empty string is not a valid count fingerprint encoding in FPC format.
The two count Tanimoto similarities¶
In this section you’ll learn about the two ways people have extended the Tanimoto similarity to count fingerprints, and the choice chemfp uses.
I didn’t even know there were two ways until late August 2025, while updating the chemfp documentation for the 5.0 release. That caused me to dig into the history, which I’ll share here.
Vector Tanimoto¶
In 1986 Peter Willett, Vivienne Winterman, and David Bawden showed that the Tanimoto similarity of fragment screens, previously used to improve substructure search performance, performed well as a method to estimate molecular similarity. This was before the term “fingerprint” was even coined.
Their early papers defined the Tanimoto similarity for count fingerprints X and Y as Σxᵢyᵢ/(Σxᵢ²+Σyᵢ²-Σxᵢyᵢ), or when written using the vector dot product, X⋅Y/(X⋅X+Y⋅Y-X⋅Y). They used binary fragment screens, where xᵢ and yᵢ are either 0 or 1, giving the now familiar simplified equation C/(A+B-C), where A and B are the number of 1 terms in respectively X and Y, and C is the number of 1 terms common to both X and Y.
For example, given the vectors X = (2, 3, 0) and Y = (3, 1, 1) then
Σxᵢyᵢ = C = 2*3 + 3*1 + 0*1 = 9
Σxᵢ² = A = 2*2 + 3*3 + 0*0 = 13
Σyᵢ² = B = 3*3 + 1*1 + 1*1 = 11
vector Tanimoto = C/(A+B-C) = 9/(13+11-9) = 9/15 = 3/5
I’ve read those foundational papers several times over the years but it wasn’t until yesterday (2025-09-10 to be precise) that I noticed that also considered count fingerprints, along with the Tanimoto vector interpretation, in their evaluation.
Multiset Tanimoto¶
I prefer the multiset Tanimoto, and in an informal survey of 9 mostly RDKit users only 2 did not, both citing the long-time use of the vector definition in publications.
The binary Tanimoto can also be written in set notation as |X∩Y|/|X∪Y|, that is, the number of elements in common between X and Y divided by the number of elements in the union.
A multiset is a modification of a set where set elements also have a multiplicity, like X = {a, a, b, b, b} where “a” is present two times and “b” is present three times, and where set operations include multiplicity.
For example, given the multisets X = {a, a, b, b, b} and Y = {a, a, a, b, c}, |X∩Y| = {a, a, b} because “a” is present two times in both X and Y, “b” is present one time in both X in Y, while “c” is not included as is not present in both X and Y. Similarly, |X∪Y| = {a, a, a, b, b, b, c} because “a” is most common in Y (three times), “b” is most common in X (two times), and “c” is is most common in Y (one time), even though it’s not present in X.
This gives a multiset Tanimoto similarity of 3/7.
When expressed as a count vector, the multiset Tanimoto is Σmin(xᵢ,yᵢ)/Σmax(xᵢ,yᵢ), It can also be written in C/(A+B-C) form where C = Σmin(xᵢ,yᵢ), A = Σxᵢ and B = Σyᵢ.
RDKit uses the multiset Tanimoto:
>>> from rdkit import DataStructs
>>> fp1 = DataStructs.IntSparseIntVect(3)
>>> fp1[0], fp1[1], fp1[2] = (2, 3, 0)
>>> fp2 = DataStructs.IntSparseIntVect(3)
>>> fp2[0], fp2[1], fp2[2] = (3, 1, 1)
>>> DataStructs.TanimotoSimilarity(fp1, fp2)
0.42857142857142855
>>> 3/7
0.42857142857142855
History¶
I was curious about the publication history of these two formulations so dug into the literature.
It’s pretty well known that what we call the Tanimoto similarity is the same as the Paul Jaccard’s coefficient de communauté, introduced back in the early 1900s, used to identify similarity between Alpine (and near-Alpine) regions based on distinct species found in each region. Less well known is that Jaccard’s papers never expressed his eponymous similarity in mathematical form. He shows the calculation for one pair in the linked-to 1901 paper, and a summary textual description in his 1907 paper, which was included in the 1912 English translation.
Jaccard was not the first. Indeed, history shows that many similarity indices have been independently discovered multiple times. As I learned from the Wikipedia entry for Jaccard index, it’s the same as Grove Karl Gilbert’s ratio of verification from 1884, which was part of the “Finley affair” in meterology.
Going forward in time, in 1920, Henry Allan Gleason argued “Jaccard’s method fails to take account of the much greater importance of some abundant species, and the resulting error of computation may be obviated, in part at least, by weighting each species with its frequency index.” He proposed a quantitative interpretation of Jaccard’s formula, which is identical to the multiset Tanimoto. While I found some citations to his work, it seems rarely used. Gleason himself pointed out one flaw – size should also be a factor. In my interpretation, an elephant and an ant shouldn’t be weighed equally.
Gleason’s quantitative extension of the Jaccard’s index is also referred to as the Ruzicka index after a 1958 paper by Milan Ružička in Anwendung mathematisch-statisticher Methoden in der Geobotanik (synthetische Bearbeitung von Aufnahmen) Biologia, Bratislava, 13 (1958), pp. 647-661. As far as I can tell they are identical.
In the late 1950s the IBM mathematician Taffee T. Tanimoto worked with the cassava research pioneer David J. Rogers of the New York Botanical Gardens to develop a computer-based method to classify plants. In 1958 Tanimoto wrote an article for an internal IBM publication, which is sometimes cited but hard to find. To get my copy I had to pay a library to send me a photocopy. In it he defines the “similarity coefficient between two objects bⱼ and bₕ with respect to the set of attributes A by sⱼₕ = N(Bⱼ ∩ Bₕ) / N(Bⱼ ∪ Bₕ). There is no reason to doubt that this anything other than yet another independent discovery, with one notable difference – it also appears to be the first time the Jaccard similarity was written in set notation.
Rogers and Tanimoto published their joint work in Science in 1960. It starts with a lovely discussion about plant taxonomy, followed by how they collected and encoded the data, and ending with how they carried out the clustering. I’ll let you guess who wrote which sections. It describes the similarity value s₁₂ as “the ratio of the number of attributes in common in cases 1 and 2 to the number of distinct attributes possessed by cases 1 and 2 (7)”, and lets the reader make the obvious extension to the general form sᵢⱼ for cases i and j.
In the 1970s, Adamson and Bush investigated a few approaches to classify small molecules. In their 1975 paper they generate two vector types based on the augmented atoms for each of a set of molecules with anesthetic activity. Vector (i) is a binary vector based on additive coding and vector (ii) a count vector. They cite six references containing possible similarity coefficients, including Tanimoto and Rogers (1960) and chose three (not the Tanimoto!) for their evaluation. They concluded that the simple distances for the binary vector and Euclidian distances for the count vector “give more satisfactory results than functions using probabilistic weighting or standardized distance.”
The Sheffield group revisited this work in the mid-1980s, in much more depth, leading to the publications by Willett, Winterman and Bawden. (Willett told me he had sat down with the calculations and was the first to realize that the Tanimoto gave the best results. The publication history doesn’t record that detail.) Willett and Winterman compared six weighting schemes and six similarity coefficients, including the Tanimoto of a binary vector encoding if a given augmented atom exists or not, and the vector Tanimoto of the corresponding count fingerprint, with “the cosine, Tanimoto and correlation coefficients giving rather better results”. However, these differences were small, and they write “the selection for some particular application might most usefully be carried out on grounds such as the ease of computation.”
The easiest method to compute is the Tanimoto, and the easiest values to use are the binary screens already generated and available as substructure screens, leading to the first published paper on online Tanimoto nearest-neighbor search. The method very quickly became a core tool in cheminformatics, with a large number of publications in the 1990s as people explored different similarity scores and clustering methods, applied to ever larger datasets.
As you see, count fingerprints were present even in the beginning, but quickly overshadowed by binary fingerprints. For almost two decades the vector Tanimoto was used as the extension of the binary Tanimoto to count fingerprints. The earliest formulation of the multiset Tanimoto that I’ve found is from 2004, which even describes it as the multiset Tanimoto, with no mention of the vector form. Maggiora and Shanmugasundaram based their approach on fuzzy set theory.
However, for most of the next 15 or so years the multiset Tanimoto, if given a name, was generally referred to as the MaxMin kernel, to distinguish it from the (vector) Tanimoto, due to a 2005 paper by S. Joshua Swamidass, Jonathan Chen, Jocelyne Bruand, Peter Phung, Liva Ralaivola and Pierre Baldi. (A few described the multiset Tanimoto as the Ruzicka similarity or as “the multiset Tanimoto” or simply “the Tanimoto”.)
The Swamidass et al. paper also showed that a count vector could be transformed into a binary vector such that the MaxMin (ie, multiset Tanimoto) similarity between the two count vectors is identical to the vector Tanimoto between the two corresponding binary vectors. If a maximum count Nᵢ is known for each vector position i then a given count nᵢ can be encoded in unary form with nᵢ ones followed by Nᵢ-nᵢ zeros. Which happens to be the additive coding that Adamson and Bush used in the 1970s!
Papers published in the last few years (I write this in 2025) are increasingly using the multiset Tanimoto formulation, and even referring to it as “the Tanimoto” without qualifiers. This is almost certainly due to the popularity of the RDKit toolkit. The count fingerprint Tanimoto was commited to the code repository on 18 June 2009, and implemented as the multiset Tanimoto, building on “fuzzy” set operations committed on 22 September 2007.
My current interpretation is that the Sheffield group in the 1970s and 1980s was strongly influenced by mathematical taxonomy, like Sneath and Sokal’s book. Sneath and Sokal don’t describe anything like the vector Tanimoto, though they do discuss the need to normalize coefficients and the generalized Mahalanobis distance. I suspect the Sheffield used the vector interpretation because it falls out naturally from well-known concepts like Euclidian distance and cosine similarity.
As the knowledge of the utility of fingerprint similarity spread, most people focused on binary fingerprints, where Tanimoto’s set formulation is equally relevant to population counts. I believe that experience lead people who mostly had experience with binary fingerprints to extend the binary Tanimoto into the multiset formulation.
I wonder what might have happened had Sheffield been more influnced by ecology than taxonomy!
Since the set formulation for the Jaccard index came from Tanimoto, I think it’s appropriate to refer to the multiset Tanimoto as “the Tanimoto”, even though Gleason (and Ružička) developed it earlier while Tanimoto never did.
I will try to write “multiset Tanimoto” or “count Tanimoto” to prevent confusion, but may sometimes write “Tanimoto”.
Generate sparse count fingerprints with RDKit¶
In this section you’ll learn how to use generate RDKit sparse count fingerprints in FPC format using the chemfp rdkit2fpc command-line tool.
The “rdkit2fpc” chemfp subcommand by default reads SMILES from stdin, uses RDKit to generate Morgan sparse fingerprints of radius 3, and write the results to stdout in FPC format. The following example generates a fingerprint for carbon nitride:
% echo "N#CC#N carbon nitride" | chemfp rdkit2fpc
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-09-09T08:40:31+00:00
797079298:2,847433064:2,2245900962:2,2551483158:2,3389123367 carbon nitride
The command supports quite a few options. The following creates a SMILES file which is used as input to generate Morgan fingerprints of radius 1, without the date in the header.
% echo "N#CC#N carbon nitride" > C2N2.smi
% chemfp rdkit2fpc --radius 1 --no-date C2N2.smi
797079298:2,847433064:2,2245900962:2,2551483158:2 carbon nitride
Chemfp supports all of RDKit’s sparse fingerprint types –
--morgan for Morgan fingerprints (the default if nothing is
specified), --RDK for RDKit’s version of the Daylight fingerprint,
--pair for atom pairs, and --torsion for torsions:
% chemfp rdkit2fpc --morgan --no-date C2N2.smi
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#source=C2N2.smi
797079298:2,847433064:2,2245900962:2,2551483158:2,3389123367 carbon nitride
%
% chemfp rdkit2fpc --RDK --no-date C2N2.smi
#FPC1
#type=RDKit-CountFingerprint/3 minPath=1 maxPath=7
#software=RDKit/2024.09.5 chemfp/5.0
#source=C2N2.smi
986785516,1081682082,1586864133:2,2108965010:2,2784417258,3874252909:2,4247327457:2,4275705116 carbon nitride
%
% chemfp rdkit2fpc --pair --no-date C2N2.smi
#FPC1
#type=RDKit-AtomPairCount/3 minDistance=1 maxDistance=30
#software=RDKit/2024.09.5 chemfp/5.0
#source=C2N2.smi
820801,1328705:2,1328706:2,1329699 carbon nitride
%
% chemfp rdkit2fpc --torsion --no-date C2N2.smi
#FPC1
#type=RDKit-TorsionCount/4 torsionAtomCount=4
#software=RDKit/2024.09.5 chemfp/5.0
#source=C2N2.smi
10750025808 carbon nitride
If the output is not stdout (or if --progress is specified) then
rdkit2fpc will show the progress bar. Use --no-progress to disable
the progress bar. The following processes the ChEMBL 33 SD file to
save Morgan fingerprints with radius 3 in FPB format. Note that I’m
showing the partial progress so you can see what it looks like:
% rdkit2fps chembl_33.sdf.gz -o chembl_33_morgan3.fpb
chembl_33.sdf.gz: 77%|████████████████████████▌ | 595M/770M [08:27<02:40, 1.09Mbytes/s]
(I’ve omitted a number of RDKit warning and error messages like “Explicit valence … is greater than permitted”, “Warning: ambiguous stereochemistry”, and “not removing hydrogen atom without neighbors”.)
In this case it’s processed about 77% of the input file, as based on the read position in the gzip file. It’s taken 8 minutes and 27 seconds so far, and there’s an esimated 2 minutes and 40 seconds to go.
Here’s the final progress bar, once finished, showing it took about 11.5 minutes:
chembl_33.sdf.gz: 100%|█████████████████████████████| 770M/770M [11:26<00:00, 1.12Mbytes/s]
For more information about the command-line options, use chemfp rdkit2fpc –help.
Generate Morgan count fingerprints from ChEBI¶
In this section you’ll use the ChEBI structures to generate Morgan sparse count fingerprints. This will be used for the next few sections.
I tend to use ChEMBL for my examples but with over 2 million fingerprints the processing time is too long to be a good learning datasets. I’ll instead use ChEBI, which has under 200,000 structures.
To start, download the ChEBI structures from ChEBI_lite.sdf.gz and convert them to sparse Morgan fingerprints with radius 3.
% curl -O https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz
% chemfp rdkit2fpc ChEBI_lite.sdf.gz -o chebi_morgan3.fpc
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
....
Ahh, right. Most of the ChEBI records in the SDF don’t have a title
(the first line of the SDF record) and instead store the CHEBI id in
the <ChEBI ID> data item. I’ll use --id-tag to get the id from
there.
% chemfp rdkit2fpc ChEBI_lite.sdf.gz --id-tag "ChEBI ID" -o chebi_morgan3.fpc
... various warning lines omitted ...
ChEBI_lite.sdf.gz: 100%|██████████████████| 69.5M/69.5M [01:06<00:00, 1.04Mbytes/s]
%
% grep -v ^# chebi_morgan3.fpc | wc -l
191477
rdkit2fpc generated 191,477 fingerprints. Here’s what the first few lines look like:
% head -8 chebi_morgan3.fpc | expand | fold
#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#source=ChEBI_lite.sdf.gz
#date=2025-09-09T10:46:58+00:00
26234434:4,51975834,184434586,255902554,266675433,368531313,685125387,864662311:
5,951226070:2,970433791,994485099:3,1059975483,1071316524,1228528465,1395869423,
1547691028,1599425600,1634955530,1824088295,1842114921,1955172768,2060400725,210
3477448,2132620936,2283643469,2409446617,2549196227,2584737808,2679412098,269764
2734,2717067033,2766434492,2905660137,2912444370,2968968094,2976033787:2,3026394
695:3,3145415797,3189457552,3192617127,3217380708:7,3218693969:5,3349228160,3382
146383,3442795883,3509891539,3561006593,3643782586,3698257053,3880717808
CHEBI:90
10565946,364836735,516041383,864942730,1232350217,1611104173,1692438161,17842042
80,1861965050:3,2117068077,2119439498,2146244371,2246728737:3,2292935200,2865232
610,2968968094:3,2975316496,2976033787,2976816164:2,3217380708,3526198586,381839
8033,4025022058,4092462314,4273842364 CHEBI:165
515008442,533204632,667753088,864662311,864674487:2,864942730,1224372022,1298690
312,1338473746,1510328189,1535166686,2042559479,2167929383,2194347730,2214772272
,2245273601,2245384272:2,2246699815,2342113506:2,2807496773,2917680300,306650369
1,3812711892,3927890045,4003049590,4022716898 CHEBI:598
Chemfp does not implement direct similarity search of count fingerprints. Instead, at least for the chemfp 5.0 release, the fingerprints must first be converted to binary fingerprints for search. See the superimpose example below for an example.
Count fingerprint search with the RDKit API¶
In this section you’ll learn how to use the RDKit API to do a BulkTanimotoSimilarity search and report the nearest k results. This will be used to establish the expected similarities for the upcoming sections on converting count fingerprints to binary fingerprint for similarity search.
Chemfp does not implement direct count similarity search, but RDKit does. We can use RDKit to generate the Morgan2 fingerprints for ChEBI and do a nearest neighbor search. There are two phases to the program: fingerprint generation, and fingerprint search.
In first phase, use Python’s gzip module to uncompress the ChEBI_lite.sdf.gz file, Chem.ForwardSDMolSupplier to get molecules from the Python file object, and a bit of Python code to get the id and Morgan2 fingerprints for later use.
For the second phase, make an interactive program which gets the query id, looks up the fingerprint, uses BulkTanimotoSimilarity to compute similarity scores against all of the fingerprints, and a bit of Python code to sort the scores to report the nearest 20 matches.
This is chemfp documentation, not RDKit documentation, so I’ll simply present the code and let the comments explain the steps:
import gzip
import random
import time
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=3)
# Store parallel lists of the record ids and fingerprints.
ids = []
fps = [] # this list will be passed to BulkTanimotoSimilarity
# Mapping from record id to record index.
id_to_record_idx = {}
t1 = time.time()
# RDKit does not support reading compressed files.
#
# - One option is to decompress the file first, which would let us use
# the multi-threaded fingerprint generation in
# rdMolProcessing.GetFingerprintsForMolsInFile, which requires a
# filename, not a Python file object.
#
# - Or we let Python handle the decompression and pass a file
# handle to Chem.ForwardSDMolSupplier, like this:
with gzip.open("ChEBI_lite.sdf.gz") as infile:
with Chem.ForwardSDMolSupplier(infile) as suppl:
for mol in suppl:
if mol is None:
# Could not process
continue
# Get the id, generate the fingerprint, and save.
id = mol.GetProp("ChEBI ID")
fp = fpgen.GetSparseCountFingerprint(mol)
id_to_record_idx[id] = len(ids)
ids.append(id)
fps.append(fp)
t2 = time.time()
print(f"Generated {len(ids)} fingerprints in {t2-t1:.1f} seconds.")
# Interactive k-nearest search.
print("Enter a query id, 'random' for a random id, or 'quit' to quit.")
k = 20
try:
while 1:
query_id = input("> Query id? ")
# Some helpful commands
if query_id == "quit":
break
if query_id == "random":
print(f"Here is a random id: {random.choice(ids)}")
continue
# Look up the record index, if available, then get the fingerprint.
query_idx = id_to_record_idx.get(query_id, None)
if query_idx is None:
print(f"** Not found: {query_id!r} **")
continue
query_fp = fps[query_idx]
# Do the bulk search
t1 = time.time()
# This returns a score for each fingerprint. We want the k-nearest.
scores = DataStructs.BulkTanimotoSimilarity(query_fp, fps)
# We can let sort(reverse=True) sort the scores from largest
# to smallest. It will break ties on the index, with the largest
# index first. I want the first index to come first, so use -index.
scores_and_negated_indices = [(score, -i) for i, score in enumerate(scores)]
scores_and_negated_indices.sort(reverse=True)
t2 = time.time()
# Report the k-nearest hits.
print(f"Search time: {t2-t1:.1f} seconds.")
for target_score, neg_target_idx in scores_and_negated_indices[:k]:
print(f"{query_id}\t{ids[-neg_target_idx]}\t{target_score:.4f}")
except EOFError:
print() # Write a newline to get off the input() prompt.
print("Bye!")
It takes about a minute to read the structures and generate fingerprints before it gets to the prompt:
% python chebi_search.py
[09:35:08] WARNING: not removing hydrogen atom without neighbors
[09:35:08] WARNING: not removing hydrogen atom with dummy atom neighbors
... many warnings omitted ...
Generated 191477 fingerprints in 59.3 seconds.
Enter a query id, 'random' for a random id, or 'quit' to quit.
> Query id?
I’ll use phenol, which is CHEBI:15882, as a query:
> Query id? CHEBI:15882
Search time: 0.6 seconds.
CHEBI:15882 CHEBI:15882 1.0000
CHEBI:15882 CHEBI:17296 0.5172
CHEBI:15882 CHEBI:17578 0.5172
CHEBI:15882 CHEBI:28097 0.5172
CHEBI:15882 CHEBI:49819 0.5172
CHEBI:15882 CHEBI:3179 0.5172
CHEBI:15882 CHEBI:5115 0.5172
CHEBI:15882 CHEBI:5611 0.5172
CHEBI:15882 CHEBI:30396 0.5172
CHEBI:15882 CHEBI:30787 0.5172
CHEBI:15882 CHEBI:48372 0.5172
CHEBI:15882 CHEBI:48498 0.5172
CHEBI:15882 CHEBI:50526 0.5172
CHEBI:15882 CHEBI:51470 0.5172
CHEBI:15882 CHEBI:52245 0.5172
CHEBI:15882 CHEBI:137811 0.5172
CHEBI:15882 CHEBI:172874 0.5172
CHEBI:15882 CHEBI:17987 0.5000
CHEBI:15882 CHEBI:28902 0.5000
CHEBI:15882 CHEBI:37866 0.5000
Next I’ll do a couple of randomly chosen queries, then quit:
> Query id? random
Here is a random id: CHEBI:60990
> Query id? CHEBI:60990
Search time: 0.8 seconds.
CHEBI:60990 CHEBI:60990 1.0000
CHEBI:60990 CHEBI:60993 1.0000
CHEBI:60990 CHEBI:60999 0.6525
CHEBI:60990 CHEBI:18332 0.4530
CHEBI:60990 CHEBI:30659 0.4530
CHEBI:60990 CHEBI:30660 0.4530
CHEBI:60990 CHEBI:28253 0.4462
CHEBI:60990 CHEBI:15768 0.4330
CHEBI:60990 CHEBI:80689 0.4186
CHEBI:60990 CHEBI:60302 0.4167
CHEBI:60990 CHEBI:6446 0.4132
CHEBI:60990 CHEBI:6447 0.4098
CHEBI:60990 CHEBI:90872 0.3952
CHEBI:60990 CHEBI:80691 0.3945
CHEBI:60990 CHEBI:16122 0.3762
CHEBI:60990 CHEBI:131194 0.3667
CHEBI:60990 CHEBI:28774 0.3659
CHEBI:60990 CHEBI:11684 0.3659
CHEBI:60990 CHEBI:232361 0.3659
CHEBI:60990 CHEBI:90871 0.3654
> Query id? random
Here is a random id: CHEBI:95267
> Query id? CHEBI:95267
Search time: 0.7 seconds.
CHEBI:95267 CHEBI:95267 1.0000
CHEBI:95267 CHEBI:194983 0.3933
CHEBI:95267 CHEBI:125656 0.3585
CHEBI:95267 CHEBI:105618 0.3363
CHEBI:95267 CHEBI:121557 0.3279
CHEBI:95267 CHEBI:116541 0.3182
CHEBI:95267 CHEBI:92134 0.3158
CHEBI:95267 CHEBI:107761 0.3145
CHEBI:95267 CHEBI:92565 0.3140
CHEBI:95267 CHEBI:123434 0.3130
CHEBI:95267 CHEBI:195049 0.3118
CHEBI:95267 CHEBI:228582 0.3097
CHEBI:95267 CHEBI:194995 0.3061
CHEBI:95267 CHEBI:195078 0.3043
CHEBI:95267 CHEBI:92924 0.3036
CHEBI:95267 CHEBI:34377 0.3000
CHEBI:95267 CHEBI:120965 0.3000
CHEBI:95267 CHEBI:173662 0.3000
CHEBI:95267 CHEBI:93498 0.2991
CHEBI:95267 CHEBI:92108 0.2941
> Query id? quit
Bye!
CHEBI:60990 is 3,5-diiodotyrosyl-3,5-diiodotyrosine with SMILES
NC(Cc1cc(I)c(O)c(I)c1)C(=O)NC(Cc1cc(I)c(O)c(I)c1)C(=O)O.
CHEBI:95267 is
N-(2-bromo-4-methylphenyl)-5-pyridin-4-yl-2-thiophenamine with SMILES
Cc1ccc(Nc2ccc(-c3ccncc3)s2)c(Br)c1.
Convert sparse count fingerprints to binary¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert sparse count fingerprints in FPC format to binary fingerprints in FPS or FPB format using chemfp fpc2fps command-line tool. The last examples require the chebi_morgan3.fpc file created earlier.
There is no general-purpose method to convert count fingerprints to sparse fingerprints, or at least, no practical method. Why not? Chemfp’s FPC parser supports 2⁶⁴ feature ids and 2³² counts, resulting in a huge number of potential distinct fingerprints, which cannot all be expressed as a binary fingerprint with only a few tens of thousands of bits.
Instead, chemfp implements several different methods to convert sparse fingerprints to binary fingerprints, which can then be processed by the rest of chemfp. See the superimpose example below for an example of using the default “superimpose” method for a similarity search of the ChEBI data set. The full list is:
superimpose - use random superimposed coding for the count
scaled - same as superimpose but with scaled counts
fold - use modulo reduction on the feature id, ignore counts
rdkit-count-sim or rdkit - use RDKit’s count simulation algorithm
seq - use unary encoding for dense count fingerprints
seq-scaled - same as seq but with scaled counts
Each method will be discussed in its own section below.
The classic solution is to “fold” the features based on the feature
id, and ignore the count. If the output fingerprint is “num_bits”
long, then let B be the feature id modulo num_bits and set output bit
B to 1. The following example converts a count fingerprint with three
features into a binary fingerprint using the fold method (-m
is shorthand for –method`):
% printf "65,67:10,129\tABC\n" | chemfp fpc2fps -m fold --num-bits 64
#FPS1
#num_bits=64
#type=fold/1 num_bits=64
0a00000000000000 ABC
Both 65 and 129 modulo 64 are 1, and 67 modulo 64 is 3, so the first fingerprint byte should be “00001010” in binary, or “0a” in hex, which matches what we see in the output. (In chemfp, the bits of a byte are in big-endian order, with bit 0 on the right.)
The “#type” line shows that the input was folded to 64 bits. In upcoming sections you’ll see see how the conversion method type is appended to any existing type with the method type.
RDKit uses a method it calls “count simulation”, which is also
available in chemfp as -m rdkit-count-sim or -m rdkit for short:
% printf "65,67:10,129\tABC\n" | chemfp fpc2fps -m rdkit --num-bits 64
#FPS1
#num_bits=64
#type=rdkit-count-sim/1 num_bits=64 countBounds=1,2,4,8
30f0000000000000 ABC
The “#type” line includes the parameters needed to reproduce the conversion method, in this case showing RDKit’s default count bounds of 1, 2, 4, 8 , which approximates a log-count in base 2.
These two examples read the count fingerprints from stdin and wrote the binary fingerprints to stdout. The fpc2fps program can also read from and write to filenames. The following reads from the chebi_morgan3.fpc file and writes the folded fingerprints in FPS format to stdout:
% chemfp fpc2fps -m fold chebi_morgan3.fpc | head -8 | expand | fold -w 72
#FPS1
#num_bits=2048
#type=RDKit-MorganCount/2 radius=3 useFeatures=0 | fold/1 num_bits=2048
#software=RDKit/2024.09.5 chemfp/5.0
#source=ChEBI_lite.sdf.gz
#date=2025-09-09T10:46:58+00:00
000001000000000000002000000000000000000080000010000000008000000000040000
000000080000000000020000000000000100000000000800000000000000000000000000
002000000000000001000100000000000000000000420000000000008000000000000204
000800000000004000000000000000000000000800001000000000000100000000002000
000000200000000000010002000200800000000000000000008000001008000004000000
000000040800000000000100020000000010000084000000000000000001000400000000
000040000080000000080002000000000000020000000200000000040000000000000400
00000000 CHEBI:90
000000000000000000000000000000000000000000000010000000000000000000000000
000000040000020000000000000000000000000000000000000000000000010001000000
000000000000000000040000800000000000000004000000000000000000000000000000
000000000000004000000000000000000000000900020000120000000000000000000000
000400000000000000000000000400040000000000200004000000001000000000000000
000000000000000000000000000000000000000000000000000400000000020000000000
000000000000000000000800000000000000000000000080000000000020000000000000
00000000 CHEBI:165
The “#type” line shows the parameters for both the original fingerprint generator parameters and the conversion method, sepearated by the pipe symbol “|”. At present this information is only for record-keeping as chemfp does not yet parse this line.
The output fingerprints can also be written directly to FPB format, like this example using RDKit’s count simulation method:
% chemfp fpc2fps -m rdkit chebi_morgan3.fpc -o chebi_morgan3_countsim.fpb
chebi_morgan3.fpc: 100%|██████████████| 150M/150M [00:09<00:00, 16.0Mbytes/s]
As usual with the chemfp tools, it shows a progress bar if the output
is not stdout. Use --progress to always have a progress bar or
--no-progress to always disable the progress bar.
fpc2fps “fold” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert the count fingerprints to binary using the “fold” method, and see how the similarity search scores compare to count scores computed by RDKit. You will need the chebi_morgan3.fpc file created earlier.
How does folding work?¶
In the context of cheminformatics fingerprints, the term “fold” refers to a modulo operation. Given an feature id F, and N bits in the fingerprint then B = F modulo N is the folded value for F, and bit B will be set to 1. This method came from the original Daylight fingerprints which generated a hash for each path that was then folded to set a bit in the fingerprint.
Folding ignores the count.
The following show a hypothetical Python implementation:
def fold(count_fingerprint, num_bits):
output_fp = BinaryFingerprint(num_bits)
for feature_id, feature_count in count_fingerprint:
output_fp.SetOnBit(feature_id % num_bits)
return output_fp
Use the chemfp fpc2fps option -m fold to
fold count fingerprints to a given number of bits. The default is 2048
bits, while the following example uses 64 bits to convert an FPC file
into FPS format:
% printf "65,67:10,129\texample\n" | chemfp fpc2fps -m fold --num-bits 64
#FPS1
#num_bits=64
#type=fold/1 num_bits=64
0a00000000000000 example
Both 65 and 129 modulo 64 are 1, and 67 modulo 64 is 3, so the first fingerprint byte should be “00001010” in binary, or “0a” in hex, which matches what we see in the output. (In chemfp, the bits of a byte are in big-endian order, with bit 0 on the right.)
Folding in action¶
The following folds the chebi_morgan3.fpc fingerprints to an FPS files (without a progress bar), extracts the fingerprints for CHEBI:15882, CHEBI:60990, and CHEBI:95267 to use as queries, then uses simsearch to find the k=10 nearest neighbors for each search, with the output written to stdout in tsv format.
% chemfp fpc2fps -m fold chebi_morgan3.fpc -o chebi_morgan3_fold.fps --no-progress
% grep -E "CHEBI:(15882|60990|95267)$" chebi_morgan3_fold.fps > fold_queries.fps
% simsearch -k 10 --queries fold_queries.fps chebi_morgan3_fold.fps --out tsv
query_id target_id score
CHEBI:15882 CHEBI:15882 1.0000000
CHEBI:15882 CHEBI:49819 0.5200000
CHEBI:15882 CHEBI:28902 0.4642857
CHEBI:15882 CHEBI:48285 0.4642857
CHEBI:15882 CHEBI:3179 0.4615385
CHEBI:15882 CHEBI:5115 0.4615385
CHEBI:15882 CHEBI:28097 0.4615385
CHEBI:15882 CHEBI:17296 0.4615385
CHEBI:15882 CHEBI:30396 0.4615385
CHEBI:15882 CHEBI:5611 0.4615385
CHEBI:60990 CHEBI:60993 1.0000000
CHEBI:60990 CHEBI:60990 1.0000000
CHEBI:60990 CHEBI:60999 0.8369565
CHEBI:60990 CHEBI:28253 0.5436893
CHEBI:60990 CHEBI:15768 0.4878049
CHEBI:60990 CHEBI:80691 0.4673913
CHEBI:60990 CHEBI:60987 0.4368932
CHEBI:60990 CHEBI:60989 0.4368932
CHEBI:60990 CHEBI:90871 0.4367816
CHEBI:60990 CHEBI:18332 0.4215686
CHEBI:95267 CHEBI:95267 1.0000000
CHEBI:95267 CHEBI:194983 0.3846154
CHEBI:95267 CHEBI:125656 0.3666667
CHEBI:95267 CHEBI:120965 0.3294118
CHEBI:95267 CHEBI:92134 0.3265306
CHEBI:95267 CHEBI:105618 0.3229167
CHEBI:95267 CHEBI:108072 0.3222222
CHEBI:95267 CHEBI:194883 0.3170732
CHEBI:95267 CHEBI:92924 0.3157895
CHEBI:95267 CHEBI:114486 0.3125000
I compared these to the count Tanimoto scores that RDKit computed. There was no overlap between the 9 nearest binary neighbors of CHEBI:15882 (phenol) and its 19 nearest count neighbors. 6 of the 9 binary nearest neighbors of CHEBI:60990 were in its 19 nearest count neighbors. 4 of the 9 binary neighbors neighbors of CHEBI:95267 were in its 19 nearest count neighbors.
RDKit uses folding to generate binary Morgan fingerprints, which we can verify by asking rdkit2fps to generate the fingerprint for the phenol SMILES “c1ccccc1O”:
% echo "c1ccccc1O CHEBI:15882" | rdkit2fps --morgan3 --no-metadata
000000000000000002000000000000000000000000000000000000000000000000000000
000000000000000000000000200000000000000000000000000000000080000000000000
000000000000000000000000000000000000000000020000000000008000000000000000
000000000000000000000000000000000000000000000000000000000100000000000000
000000000080000000000000000000000000000000000000000000001000000000000000
000000000000000000000000000000000000000004000000000000000000000000000000
000040000000040000000000000000000000020000000000000000080000000000000000
00000000 CHEBI:15882
% grep 'CHEBI:15882$' fold_queries.fps
000000000000000002000000000000000000000000000000000000000000000000000000
000000000000000000000000200000000000000000000000000000000080000000000000
000000000000000000000000000000000000000000020000000000008000000000000000
000000000000000000000000000000000000000000000000000000000100000000000000
000000000080000000000000000000000000000000000000000000001000000000000000
000000000000000000000000000000000000000004000000000000000000000000000000
000040000000040000000000000000000000020000000000000000080000000000000000
00000000 CHEBI:15882
I don’t know about you, but I think it’s tedious and error-prone to read the hex values. As I only want to test for equivalence, I’ll generate MD5 checksums, which are much easier to compare:
% echo "c1ccccc1O CHEBI:15882" | rdkit2fps --morgan3 --no-metadata | md5sum
8f3dd5173bcd8c7e57467e2ac3bfbfc6 -
% grep 'CHEBI:15882$' fold_queries.fps | md5sum
8f3dd5173bcd8c7e57467e2ac3bfbfc6 -
fpc2fps “rdkit-count-sim” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert count fingerprints to binary using RDKit’s count simulation method, and see how the similarity search scores compare to count scores computed by RDKit. You will need the chebi_morgan3.fpc file created earlier.
How does count simulation work?¶
The RDKit toolkit implements a “count simulation” method based around
a list of count bounds, such as countBounds=[1,10,100]. In this
case, the fingerprint is segmented into bands where each band has three
bits. If the count for a band is at least 1 then the first bit is set
to 1. If the count is at least 10 then the second bit is set to 1. If
the count is at least 100 then the third bit is set to 1.
The effective fingerprint size is now 1/3 of the original size, while each bin contains a integer log, base 10, of the count, expressed in unary. This is equivalent to the additive coding used in the 1970s work of Adamson and Bush.
The bounds do not need to be successive powers. (I’m curious about useful alternatives.) They don’t even need to be in increasing order, though that flexibility has no useful consequences.
The fpc2fps command-line tools implements RDKit’s count simualtion,
available as -m rdkit-count-sim or as the shorthand -m rdkit.
Here is an example:
% printf "2,5:11,93:3,220:44\tABC\n" | \
chemfp fpc2fps -m rdkit-count-sim --num-bits 32 --countBounds 1,4,12,20
#FPS1
#num_bits=32
#type=rdkit-count-sim/1 num_bits=32 countBounds=1,4,12,20
00017f00 ABC
This has four bounds giving a fingerprint with 32//4 = 8 bins.
The feature ids 2,5, 93, and 220, modulo 8, are 2, 5, 5, 4. These are the bin numbers. Bin 2 contains the count 1, bin 5 contains the count 11+3=14, and bin 4 contains the count 44, giving the dense count bins:
[0, 0, 1, 0, 44, 14, 0, 0] # dense count bins
Using the count bounds, bin 2 is encoded as “1000”, bin 4 is encoded as “1111” and bin 5 is encoded as “1110” while the others are encoded as “0000”, giving little endian bit pattern (where bit 0 is on the far left and bit 31 is on the far right):
0000_0000_1000_0000_1111_1110_0000_0000 # 32-bits, little endian
Chemfp uses little-endian bytes in a fingerprint, and big-endian bits in a byte, so every 8 bits needs to be flipped around:
0000_0000_0000_0001_0111_1111_0000_0000 # 32-bits, mixed endian
which in hex is “00017f00”, matching what was shown above.
Count simulation in action¶
The following uses count simulation to convert the chebi_morgan3.fpc fingerprints to an FPS file (without a progress bar), extracts the fingerprints for CHEBI:15882, CHEBI:60990, and CHEBI:95267 to use as queries, then uses simsearch to find the k=10 nearest neighbors for each search, with the output written to stdout in tsv format.
% chemfp fpc2fps -m rdkit chebi_morgan3.fpc -o chebi_morgan3_countsim.fps --no-progress
% grep -E "CHEBI:(15882|60990|95267)$" chebi_morgan3_countsim.fps > countsim_queries.fps
% simsearch -k 10 --queries countsim_queries.fps chebi_morgan3_countsim.fps --out tsv
query_id target_id score
CHEBI:15882 CHEBI:15882 1.0000000
CHEBI:15882 CHEBI:49819 0.5200000
CHEBI:15882 CHEBI:28902 0.4642857
CHEBI:15882 CHEBI:48285 0.4642857
CHEBI:15882 CHEBI:3179 0.4615385
CHEBI:15882 CHEBI:5115 0.4615385
CHEBI:15882 CHEBI:28097 0.4615385
CHEBI:15882 CHEBI:17296 0.4615385
CHEBI:15882 CHEBI:30396 0.4615385
CHEBI:15882 CHEBI:5611 0.4615385
CHEBI:60990 CHEBI:60993 1.0000000
CHEBI:60990 CHEBI:60990 1.0000000
CHEBI:60990 CHEBI:60999 0.8369565
CHEBI:60990 CHEBI:28253 0.5436893
CHEBI:60990 CHEBI:15768 0.4878049
CHEBI:60990 CHEBI:80691 0.4673913
CHEBI:60990 CHEBI:60987 0.4368932
CHEBI:60990 CHEBI:60989 0.4368932
CHEBI:60990 CHEBI:90871 0.4367816
CHEBI:60990 CHEBI:18332 0.4215686
CHEBI:95267 CHEBI:95267 1.0000000
CHEBI:95267 CHEBI:194983 0.3846154
CHEBI:95267 CHEBI:125656 0.3666667
CHEBI:95267 CHEBI:120965 0.3294118
CHEBI:95267 CHEBI:92134 0.3265306
CHEBI:95267 CHEBI:105618 0.3229167
CHEBI:95267 CHEBI:108072 0.3222222
CHEBI:95267 CHEBI:194883 0.3170732
CHEBI:95267 CHEBI:92924 0.3157895
CHEBI:95267 CHEBI:114486 0.3125000
I compared these to the count Tanimoto scores that RDKit computed.
8 of the 9 nearest binary neighbors of CHEBI:15882 (phenol) were in the nearest 19 count neighbors, which is much better than the complete lack of overlap with the folding method!
There was smaller improvement for the other two, larger structures. 7 of the 9 binary nearest neighbors of CHEBI:60990 were in its 19 nearest count neighbors (it was 6 for folding). 6 of the 9 binary neighbors neighbors of CHEBI:95267 were in its 19 nearest count neighbors (it was 4 for folding).
I looked at the count fingerprints for the queries. Almost 90% of the feature counts are 1 (70%) or 2 (17%) so the count bounds does a good job of distinguishing between them by using “1000” or “1100”, resulting in an improved binary approximation to the multiset Tanimoto.
RDKit has the option to use count simulation when generating binary Morgan fingerprints, which we can verify by asking rdkit2fps to generate that fingerprint for the phenol SMILES “c1ccccc1O” and compare both the actual fingerprint and their respective MD5 checksums.
% echo "c1ccccc1O CHEBI:15882" | rdkit2fps --morgan3 --countSimulation=1 --no-metadata
000000000000001000000000000000000000000000000000000000000000000033010000
000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000300000000000000000000000000000000000000003
000000000000000010000000000100000000000000000000000000000000000000000000
000000100000000000000000000000000000000000000000700000000000000000000100
000000000000000000000000000010000000000000000000001000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000
00000000 CHEBI:15882
% grep 'CHEBI:15882$' countsim_queries.fps
000000000000001000000000000000000000000000000000000000000000000033010000
000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000300000000000000000000000000000000000000003
000000000000000010000000000100000000000000000000000000000000000000000000
000000100000000000000000000000000000000000000000700000000000000000000100
000000000000000000000000000010000000000000000000001000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000
00000000 CHEBI:15882
% echo "c1ccccc1O CHEBI:15882" | rdkit2fps --morgan3 --countSimulation=1 \
--no-metadata | md5sum
f67dcc5f85f819cbf2037e71a9534ec4 -
% grep CHEBI:15882 countsim_queries.fps | md5sum
f67dcc5f85f819cbf2037e71a9534ec4 -
fpc2fps “superimpose” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert count fingerprints to binary using chemfp’s superimpose method, and see how the similarity search scores compare to count scores computed by RDKit. You will need the chebi_morgan3.fpc file created earlier.
How does superimpose work?¶
RDKit’s count simulation uses a unary encoding to map a value into a fixed-width bin. With the default “1,2,4,8” bound, the counts of 2 and 3 cannot be distinguished as both are encoded as “1100”. The same for the counts of 4, 5, 6, and 7, which are all encoded as “1110”.
Finer resolution is possible by increasing the number of bounds, but then the effective fingerprint size will decrease, causing more features to end up in the same dense count bin. Increasing the fingerprint size will restore accuracy but result in very sparse fingerprints.
We can use sparsity to our advantage. The ChEBI Morgan3 fingerprints has about 69 distinct features and an average count of 1.75 giving an average of about 120 possible binary descriptors, where each descriptor is a (feature id, count) pair. Why? If feature 9 is present 2 times then (9, 1) and (9, 2) are possible descriptors, because if feature 9 occurs twice then feature 9 must also occur at least once.
If each descriptor is consistently assigned to a randomly chosen bit then with 2048 bits the odds of two descriptors colliding is about 6%, which is pretty low.
This is a form of randomized superimposed coding, whose roots lie in Calvin Mooers’ Zatocoding from the 1940s. In cheminformatics it’s best known as the heart of Daylight fingerprints. For each path, generate a hash value, use it seed a random number generator, use the generator to get k values and set the corresponding folded fingerprint bit index to 1.
If k is constant then this is like a Bloom filer with the hash functions replaced by a random number generator. However, k does not need to be constant. In Zatocoding it was based on descriptor frequency and for Daylight fingerprint it was based on the path length (although most Daylight-like fingerprints including RDKit’s use a fixed k).
For chemfp’s “superimpose” method, k is the feature count, and a hypothetical Python implementation might look like:
def superimpose(count_fingerprint, num_bits):
output_fp = BinaryFingerprint(num_bits)
for feature_id, feature_count in count_fingerprint:
rng = RandomNumberGenerator(seed = feature_id)
for i in range(feature_count):
bitno = rng.randrange(num_bits)
output_fp.SetOnBit(bitno)
return output_fp
In words, it uses the feature id as a seed for a random number generator, which is used to get “feature count” number of bit positions (some may be duplicates), all of which are set to 1.
The full implementation is a bit more complicated. Instead of setting
one bit per (feature id, specific count) pair (ie, each descriptor)
it’s can set a fixed number of bits using --bits-per-count, and to
prevent very common features from saturating the fingerprint the count
can have an upper bound. The full method is more like:
def superimpose(count_fingerprint, num_bits, bits_per_count=1, max_count=2**32-1):
output_fp = BinaryFingerprint(num_bits)
for feature_id, feature_count in count_fingerprint:
rng = RandomNumberGenerator(seed = feature_id)
for i in range(min(feature_count, max_count) * bits_per_count):
bitno = rng.randrange(num_bits)
output_fp.SetOnBit(bitno)
return output_fp
The following example reads a sparse count fingerprint definition from stdin then uses the superimpose method to generates 256-bit fingerprints in FPS format to stdout.
% printf "2,5:11,93:3,220:31\tDEF\n" | \
chemfp fpc2fps -m superimpose --num-bits 256 --max-count 20
#FPS1
#num_bits=256
#type=superimpose/1 num_bits=256 max_count=20
0980000000000188000284088110240040808240000420046900405008000020 DEF
It’s too complicated to repeat the RNG calculations, but we can verify the number of on-bits is reasonable. The total number of descriptors is 1+11+3+min(31,20) = 35 so the output fingerprint will have at most 35 bits set. It actually has 32:
>>> from chemfp import bitops
>>> bitops.hex_popcount(
... "0980000000000188000284088110240040808240000420046900405008000020")
32
With 256 bits there’s a 1-in-256 chance that two randomly chosen bit positions will be the same. We can let Python select 35 values at random and see how many are distinct.
>>> import random
>>> len(set(random.choice(range(256)) for i in range(35)))
33
Do that 1,000 times then report the average and standard deviation:
>>> import statistics
>>> print("mean:", statistics.mean(counts), "stddev:", statistics.stdev(counts))
mean: 32.731 stddev: 1.372779488309584
32 distinct values is well within that range.
I tried generating the fingerprints with longer lengths. With a 512-bit fingerprint there were 31 on-bits. With a 2048-bit fingerprint there were 35 on-bits, which is the best we can expect.
Superimposed fingerprints in action¶
The following uses the superimpose method to convert the chebi_morgan3.fpc fingerprints to an FPS file (without a progress bar), extracts the fingerprints for CHEBI:15882, CHEBI:60990, and CHEBI:95267 to use as queries, then uses simsearch to find the k=10 nearest neighbors for each search, with the output written to stdout in tsv format. I have omitted specifying the “superimpose” method because that is the default method.
% chemfp fpc2fps chebi_morgan3.fpc -o chebi_morgan3_superimpose.fps --no-progress
% grep -E "CHEBI:(15882|60990|95267)$" chebi_morgan3_superimpose.fps > superimpose_queries.fps
% simsearch -k 10 --queries superimpose_queries.fps chebi_morgan3_superimpose.fps --out tsv
query_id target_id score
CHEBI:15882 CHEBI:15882 1.0000000
CHEBI:15882 CHEBI:52245 0.5714286
CHEBI:15882 CHEBI:3179 0.5172414
CHEBI:15882 CHEBI:5115 0.5172414
CHEBI:15882 CHEBI:17296 0.5172414
CHEBI:15882 CHEBI:28097 0.5172414
CHEBI:15882 CHEBI:5611 0.5172414
CHEBI:15882 CHEBI:30396 0.5172414
CHEBI:15882 CHEBI:30787 0.5172414
CHEBI:15882 CHEBI:49819 0.5172414
CHEBI:60990 CHEBI:60993 1.0000000
CHEBI:60990 CHEBI:60990 1.0000000
CHEBI:60990 CHEBI:60999 0.6521739
CHEBI:60990 CHEBI:30660 0.4601770
CHEBI:60990 CHEBI:18332 0.4601770
CHEBI:60990 CHEBI:30659 0.4601770
CHEBI:60990 CHEBI:28253 0.4409449
CHEBI:60990 CHEBI:15768 0.4315789
CHEBI:60990 CHEBI:60302 0.4310345
CHEBI:60990 CHEBI:80689 0.4274194
CHEBI:95267 CHEBI:95267 1.0000000
CHEBI:95267 CHEBI:194983 0.4022989
CHEBI:95267 CHEBI:125656 0.3762376
CHEBI:95267 CHEBI:123434 0.3738318
CHEBI:95267 CHEBI:92134 0.3577982
CHEBI:95267 CHEBI:105618 0.3545455
CHEBI:95267 CHEBI:195049 0.3483146
CHEBI:95267 CHEBI:121557 0.3418803
CHEBI:95267 CHEBI:120965 0.3368421
CHEBI:95267 CHEBI:92108 0.3362832
I compared these to the count Tanimoto scores that RDKit computed.
All 9 of the nearest binary neighbors of CHEBI:15882 (phenol) were in the nearest 15 count neighbors, which is better than the RDKit count simulation method which only had 8 in the top 19, and of course better than the complete lack of overlap when using the folding method. In addition, 8 of the 9 binary fingerprint has the same Tanimoto score as the original count fingerprint Tanimoto.
This is why I prefer the multiset Tanimoto over the vector Tanimoto.
For the CHEBI:60990 query, all 9 nearest binary neighbors were also the 9 nearest count neighbors, though the order was not the same due to small differences in the score (eg, 0.4310 instead of 0.4167 for CHEBI:60302). By comparison, count simulation only had 7 of the nearest binary neighbors in the 19 nearest count neighbors.
The matches for the CHEBI:95267 query were not quite as good. All 9 nearest binary neighbors were in the nearest 19 count neighbors. The biggest score difference was with CHEBI:123434 with a binary Tanimoto of 0.3738 instead of a count Tanimoto of 0.3130. It’s still closer than the 6 of 9 for count simulation and 4 of 9 for folding.
I evaluated CHEBI:95267 with --bins-per-count 2. 8 of the 9
nearest binary neighbors were also in the nearest 9 count fingerprints
but the remaining one, CHEBI:116557, wasn’t one of the 19.
It will take time to figure out good superimpose parameters.
fpc2fps “scaled” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert count fingerprints to binary using chemfp’s “scaled” method. You will need the chebi_morgan3.fpc file created earlier.
How does scaled work?¶
The “scaled” method is a hybrid between the superimpose and RDKit count simulation which uses a scale to convert feature counts to the actual repeat count to use. Like RDKit count simulation, this can be used to approximate a log-count transform, but instead of using a unary coding into a fixed-width bin like “rdkit-count-sim”, the transformed bits are superimposed across the entire fingerprint.
The scale is specified in --scale as a comma-separated list of
scaling terms mapping from a minimum count to a new repeat count. For
example, the string “1:1,4:2,16:3,64:4” specifies a table equivalent to:
def get_repeat_count(value: int) -> int:
if value >= 64: return 4
if value >= 16: return 3
if value >= 4: return 2
if value >= 1: return 1
return 0
which computes int(log₄(min(count, 64)+1)) when count is positive, otherwise 0. The repeat count used as the input to the superimpose algorithm, like this:
def scaled(count_fingerprint, num_bits):
output_fp = BinaryFingerprint(num_bits)
for feature_id, count in features:
repeat = get_repeat_count(count)
rng = RNG(feature_id)
for _ in range(repeat)
bitno = rng.randrange(num_bits)
output_fp.SetOnBit(bitno)
In addition, the scaled method supports a different scale for each feature id. This might be used to implement a form of TF-IDF (Term Frequency-Inverse Document Frequency). Contact me if you want to explore this option.
The per-feature-id scales are specified with --table as
“/”-separated terms where each term contains one or more
comma-separated feature ids, the string “->”, and a table definition.
For example, “123->1:1,10:2/456,789->1:1,8:2,64:3” uses the scale “1:1,10:2” for sparse feature bit 123 and the scale “1:1,8:2,64:3” for the sparse feature bits 456 and 789.
Scaled fingerprints in action¶
The following uses the superimpose method to convert the chebi_morgan3.fpc fingerprints to an FPS file (without a progress bar), extracts the fingerprints for CHEBI:15882, CHEBI:60990, and CHEBI:95267 to use as queries, then uses simsearch to find the k=5 nearest neighbors for each search, with the output written to stdout in tsv format.
The scale is an approximation to log-base-1.5.
% chemfp fpc2fps chebi_morgan3.fpc -o chebi_morgan3_scaled.fps --no-progress \
-m scaled --scale "1:1,2:2,3:3,5:4,7:5,11:6,17:7,25:8,38:9,57:10,86:11,129:12,194:13,291:14"
% grep -E "CHEBI:(15882|60990|95267)$" chebi_morgan3_scaled.fps > scaled_queries.fps
% simsearch -k 5 --queries scaled_queries.fps chebi_morgan3_scaled.fps --out tsv
query_id target_id score
CHEBI:15882 CHEBI:15882 1.0000000
CHEBI:15882 CHEBI:52245 0.5555556
CHEBI:15882 CHEBI:28097 0.5000000
CHEBI:15882 CHEBI:17296 0.5000000
CHEBI:15882 CHEBI:49819 0.5000000
CHEBI:60990 CHEBI:60990 1.0000000
CHEBI:60990 CHEBI:60993 1.0000000
CHEBI:60990 CHEBI:60999 0.6869565
CHEBI:60990 CHEBI:15768 0.4761905
CHEBI:60990 CHEBI:28253 0.4695652
CHEBI:95267 CHEBI:95267 1.0000000
CHEBI:95267 CHEBI:194983 0.4000000
CHEBI:95267 CHEBI:125656 0.3510638
CHEBI:95267 CHEBI:92134 0.3333333
CHEBI:95267 CHEBI:123434 0.3300000
This isn’t expected to match the count Tanimoto scores, so I didn’t do that analysis.
fpc2fps “seq” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert dense count fingerprints to binary using chemfp’s “seq” method. This is one of the few places in the documentation where I will discuss dense count fingerprints.
How does scaled work?¶
A dense count fingerprint contains feature ids 0, 1, 2, …, N-1 where N is at most a few hundred. For example, a fingerprint might be defined by descriptor counts, like the number of hydrogen bond acceptors, number of times a SMARTS substructure matches, or the number of aromatic bonds.
A given count can be unary encoded to B bits, for example, the values 0, 1, 2, and 3 can be encoded into 3 bits as “000”, “100”, “110”, and “111”, with “111” also used for larger counts.
If the count <= B then the unary encoding has the same binary Tanimoto as the original count Tanimoto.
Each feature id might use a different number of bits, based on the maximum useful count for each feature. The following uses 8 bits for each of 5 features:
% printf "0:5,1:3,2:0,4:10\tXYZ\n" | chemfp fpc2fps -m seq --sizes 8,8,8,8,8
#FPS1
#num_bits=40
#type=seq/1 num_bits=40 sizes=8,8,8,8,8
1f070000ff XYZ
The input contains 4 features. Feature id 0 occurs 5 times, which is unary encoded into 8 bits as the little-endian “11111000” or the big-endian “00011111”, which corresponds to the hex value “1f”.
Feature id 1 occurs 3 times, which is unary encoded into 8 bits as the little-endian “11100000” or the big-endian “00000111”, which corresponds to the hex value “07”.
Feature id 2 occurs 0 times, which is unary encoded into 8 bits as “00000000”, which is the hex value “00” in either endian.
Feature id 3 doesn’t exist, which corresponds to the hex value “00”.
Feature id 4 occurs 10 times, which is unary encoded into 8 bits as “11111111” because it 10 fully saturates the 8 available bits. This corresponds to the hex value “ff”.
All together this gives the binary fingerprint “1f070000ff”, as shown.
fpc2fps “scaled-seq” method¶
In this section you’ll learn how to use chemfp rdkit2fpc to convert dense count fingerprints to binary using chemfp’s “scaled-seq” method. This is one of the few places in the documentation where I will discuss dense count fingerprints.
How does scaled-seq work?¶
A dense count fingerprint contains feature ids 0, 1, 2, …, N-1 where N is at most a few hundred. For example, a fingerprint might be defined by descriptor counts, like the number of hydrogen bond acceptors, number of times a SMARTS substructure matches, or the number of aromatic bonds.
The “seq” method uses a unary encoding for the counts, up to the
number of available bits. The “scaled-seq” method uses a --table
to define a scale for each feature id.
For example, the following uses the scale “1:1,2:6” for feature 0, “1:1” which is a simple binary exists/does not exist) for feature ids 1 and 2, and the scale “1:1,2:4,9:6,20:8” for feature ids 3 and 4.
% printf "0:5,1:3,2:0,4:10\tXYZ\n" | chemfp fpc2fps -m scaled-seq \
--table "0->1:1,2:6/1,2->1:1/3,4->1:1,2:4,9:6,20:8"
#FPS1
#num_bits=24
#type=scaled-seq/1 num_bits=24 table=0->1:1,2:6/1,2->1:1/3,4->1:1,2:4,3:8
7f003f XYZ
The number of bits for a feature is the largest repeat count for its
corresponding scale, in this case, [6, 1, 1, 8, 8] giving a
fingerprint with 24 bits. (Use --num-bits to specify a larger
size.)
For feature id 0, the count 5 is scaled to 6 because 6 >= 2, and the scale’s count for 2 is 6. This is the little endian value “111111”.
For feaure id 1 the count 3 is scaled to 1 because 3 >= 1, and the scale’s count for 1 is 1. This is the little endian value “1”.
For feaure id 2 the count 0 is 0 because it’s not on the scale. (Scale counts must be at least 1.) This is the little endian value “0”.
For feature id 3, which is not present, the table specifies 8 bits, so it sets the bit patterns “00000000”.
For feature id 4, the count 10 is scaled to 6 because 10 >= 9 and the scale’s count for 9 is 6. This is the little endian value “11111100”.
The little-endian fingerprint is “11111110 00000000 11111100” where I’ve added spaces for the 8-bit boundaries. Chemfp uses bit-endian bits in a byte and little-endian byte order, giving “01111111 0000000 001111111”, or “7f 00 3f” in hex, matching what was reported.
If you have a better idea for the table encoding scheme, let me know.
Convert binary fingerprints to count¶
In this section you’ll learn how to convert binary fingerprints to count fingerprints using the chemfp fps2fpc command-line tool.
Chemfp stores binary fingerprints are a dense sequence of bits. Usually most are off-bits and only a few are on-bits. They can be represented as a sparse count fingerprint in FPC format as a list of bit positions for the on-bits.
The following uses fps2fpc to read a sparse count fingerprint record from stdin and convert it to a binary fingerprint in FPS format, written to stdout:
% printf "0025ea\tID1\n" | chemfp fps2fpc
#FPC1
#type=fps2fpc/1
8,10,13,17,19,21,22,23 ID1
The first byte in hex is “00”, where all of the bits are off.
The second byte is “25” which has the big-endian bit pattern “0010 0101” (the space marks the 4-bit/nibble boundary). Because it’s big-endian, the bit positions are 7, 6, 5, 4, 3, 2, 1, 0, so the on-bits are at positions 5, 2, and 0. The second byte starts at bit 8, so adding that offset and switching to little-endian order gives positions 8, 10, and 13.
The third byte is “ea”, which has the big-endian pattern “1110 1010” so the on-bit positions are 7, 6, 5, 3, and 1. The third byte starts at bit 16 so adding that offset and switching to little-endian order gives positions 17, 19, 21, 22, 23.
Putting those together gives the list of on-bits as 8, 10, 13, 17, 19, 21, 22, and 23, which matches the fps2fpc output.
The fps2fpc command can also read from binary fingerprint files. The following uses the Morgan2 fingerprints from ChEMBL, converted into FPB format.
% chemfp fps2fpc chembl_34.fpb | head
#FPC1
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1 | fps2fpc/1
#software=RDKit/2022.09.4
1264 CHEMBL17564
775 CHEMBL4300465
1952 CHEMBL4302507
497 CHEMBL1796997
426 CHEMBL4594411
1932 CHEMBL4597517
790 CHEMBL1098659
By default the fingerprints in an FPB file are sorted by the number of on-bits (also described as being sorted by population count). These structures set only one bit.
In case you are wondering, there are 45 fingerprints with only one feature. The 46th fingerprint is the first one sets two bits:
% chemfp fps2fpc chembl_34.fpb | grep -v '^#' | grep -n , | head -1
46:452,1866 CHEMBL69710
Use --output or -o to write the count fingerprints to a named
file. By default this will show a progress bar, though the progress
through an FPB file isn’t as informative as through an FPS file. After
processing I’ll show the record for the last fingerprint, which has
the most number of on-bits, and I’ll format it to be more presentable
(convert tabs to spaces, add a space after the commas, fold to last
space at or before column 75, then remove the space after the comma).
% chemfp fps2fpc chembl_34.fpb -o chembl_34_binary.fpc
chembl_34.fpb: 2409270 its [00:31, 76660.25 its/s]
% tail -1 chembl_34_binary.fpc | expand | sed 's/,/, /g' | fold -w75 -s | sed 's/, /,/g'
1,5,19,29,36,41,57,77,80,87,105,115,117,119,140,147,157,
188,191,193,194,196,197,203,220,227,231,235,242,253,283,294,
301,310,314,328,334,335,340,348,355,357,362,363,364,369,378,
382,383,387,389,407,431,435,446,458,460,464,474,507,512,554,
573,575,590,591,592,597,613,623,625,634,650,658,667,670,675,
684,695,708,713,714,719,724,725,733,736,739,745,763,764,782,
789,798,803,806,807,811,828,832,840,841,843,846,848,871,875,
878,881,886,890,894,895,917,919,926,931,935,943,968,977,979,
981,983,984,998,1004,1006,1009,1015,1019,1028,1034,1048,1052,
1057,1077,1085,1088,1104,1110,1113,1114,1118,1126,1141,1145,
1152,1163,1171,1182,1184,1190,1199,1212,1238,1240,1257,1263,
1283,1287,1289,1290,1292,1312,1322,1325,1345,1349,1351,1357,
1362,1365,1373,1380,1393,1407,1451,1452,1460,1479,1480,1504,
1506,1514,1516,1517,1520,1526,1527,1530,1531,1543,1544,1564,
1570,1573,1581,1589,1607,1618,1620,1624,1636,1640,1664,1689,
1712,1718,1719,1723,1736,1737,1738,1746,1750,1753,1754,1758,
1760,1769,1773,1783,1785,1788,1816,1822,1828,1832,1847,1873,
1876,1879,1888,1898,1911,1914,1916,1917,1937,1950,1956,1960,
2000,2015,2016,2018,2020,2042 CHEMBL5219064