Working with the chemfp command-line tools¶

The sections in this chapter show examples of using the chemfp command-line tools to generate and work with fingerprint files.

Generate fingerprint files from PubChem SD tags¶

In this section you’ll learn how to create a fingerprint file from an SD file which contains pre-computed CACTVS fingerprints. You do not need a chemistry toolkit for this section.

These files will be re-used in many parts of the documentation.

PubChem is a great resource of publically available chemistry information. The data is available for download (and ftp download). We’ll use some of their SD formatted files. Each record has a PubChem/CACTVS fingerprint field, which we’ll extract to generate an FPS file.

Start by downloading the files Compound_099000001_099500000.sdf.gz (from https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_099000001_099500000.sdf.gz ) and Compound_048500001_049000000.sdf.gz (from https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_048500001_049000000.sdf.gzs ). At the time of writing in June 2024 they contain 10,969 and 14,753 records, respectively. (I chose some of the smallest files so they would be easier to open and review.)

Next, convert the files into fingerprint files. On the command line do the following two commands:

sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_048500001_049000000.sdf.gz -o pubchem_targets.fps

You’ll see a progress bar for each command, which looks like:

Compound_099000001_099500000.sdf.gz: 100%|█████████████| 6.99M/6.99M [00:00<00:00, 47.4Mbytes/s]

Add the --no-progress option to turn off the progess bar, as in:

sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o \
     pubchem_queries.fps  --no-progress

Congratulations, that was it!

If you’re curious about what an FPS file looks like, here are the first 10 lines of pubchem_queries.fps, with some of the lengthy fingerprint lines replaced with an ellipsis:

#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_099000001_099500000.sdf.gz
#date=2024-06-19T08:14:14+00:00
07de0d00000000 ... 10200000000000000000     99000039
07de1c00020000 ... 00000000000000000000     99000230
074e1c00000000 ... 00000000000000000000     99001517
07de0c00000000 ... 00000400000000000000     99002251

How does this work? Each PubChem record contains the precomputed CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. Here’s what it looks like for record 99000039, which is the first record in Compound_099000001_099500000.sdf.gz:

> <PUBCHEM_CACTVS_SUBSKEYS>
AAADceB7sAAAAAAAAAAAAAAAAAAAAAAAAAA8YIAABYAAAACx9AAAHgAQAA
AADCjBngQ8wPLIEACoAzV3VACCgCA1AiAI2KG4ZNgIYPrA1fGUJYhglgDI
yccci4COAAAAAAQCAAAAAAAACAQAAAAAAAAAAA==

The --pubchem flag tells sdf2fps to get the value of that tag and decode it to get the fingerprint. It also adds a few metadata fields to the fingerprint file header.

The order of the FPS fingerprints are the same as the order of the corresponding record in the SDF. You can see that in the output, where 99000039 is the first record in the FPS fingerprints.

If you store records in an SD file then you almost certainly don’t use the same fingerprint encoding as PubChem. sdf2fps can decode from a number of encodings, like hex and base64. Use sdf2fps --help to see the list of available decoders.

The example uses -o to have sdf2fps write the output to a file instead of to stdout. By default, filenames ending in “.fps” are saved in FPS format. Use “.fps.gz” for the gzip-compressed FPS format and “.fps.zst” for the zstandard-compressed FPS format.

Filenames ending with “.fpb” are saved in FPB format. This is a binary format which is significantly faster to load.

Generate fingerprint files from CSV files¶

Chemfp supports CSV files on the command-line (with chemfp csv2fps) and in the Python API (with functions in the chemfp.csv_readers module and in the toolkit API).

For examples of using the CSV features see the chemfp 4.1 release notes. The relevant sections are:

csv background
csv2fps command (including “sniffing”)
chemfp.csv_readers module
New toolkit wrapper functions to read CSV files

k-nearest neighbor search¶

In this section you’ll learn how to search a fingerprint file to find the k-nearest neighbors. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit. (See Similarity search of ChEMBL by id for an example of k-nearest search using the chemfp.simsearch() API function.)

We’ll use the pubchem_queries.fps as the queries for a k=2 nearest neighor similarity search of the target file puchem_targets.gps:

simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps

That’s all! You should get output which starts:

#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2     99000039        48503376        0.878453        48503380        0.872928
2     99000230        48563034        0.858824        48731730        0.852273
2     99001517        48675145        0.569620        48662654        0.554217
2     99002251        48798046        0.810976        48625236        0.810651
2     99003537        48997075        0.903553        48997697        0.898477
2     99003538        48997075        0.903553        48997697        0.898477

Here’s how to interpret the output. The lines starting with ‘#’ are header lines. They contain metadata information describing the similarity search. You can see the search parameters, the name of the tool which did the search, and the filenames which went into the search.

After the ‘#’ header lines come the search results, with one query result per line. There are in the same order as the query fingerprints. Each result line contains tab-delimited columns. The first column is the number of hits. The second column is the query identifier used. The remaining columns contain the hit data, with alternating target id and its score.

For example, the first result line contains the 2 hits for the query 99000039. The first hit is the target id 48503376 with score 0.878453 and the second hit is 48503380 with score 0.872928. Since this is a k-nearest neighor search, the hits are sorted by score, starting with the highest score. Do be aware that ties are broken arbitrarily. There may be additional hits with the score 0.8728 which are not reported.

simsearch CSV output¶

In this section you’ll learn how to change the simsearch output format to CSV or TSV. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

The default simsearch output format is unique to chemfp, and therefore not so easy for other tools to parse directly. Use the --out option to change the output format to “csv” or “tsv” formats:

% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out csv | head -8
query_id,target_id,score
99000039,48503376,0.878453
99000039,48503380,0.872928
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,48675145,0.569620
99001517,48662654,0.554217
99002251,48798046,0.810976
% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out tsv | head -8
query_id      target_id       score
99000039      48503376        0.878453
99000039      48503380        0.872928
99000230      48563034        0.858824
99000230      48731730        0.852273
99001517      48675145        0.569620
99001517      48662654        0.554217
99002251      48798046        0.810976

These alternatives have one line for each hit, and no metadata.

NOTE: There are many variants of CSV and TSV output, especially with how to handle spaces and embedded commas and tabs. Chemfp uses the “excel” and “excel-tab” formats in Python’s csv module.

Threshold search¶

In this section you’ll learn how to search a fingerprint file to find all of the neighbors at or above a given threshold. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit. (See Similarity search of ChEMBL using a SMILES for an example of threshold search using the chemfp.simsearch() API function.)

Let’s do a threshold search and find all hits which are at least 0.85 similar to the queries:

simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps

The first 15 lines of output from this are:

#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.85
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
4     99000039        48732162        0.859551        48503380        0.872928        48503376        0.878453        48520532        0.854054
2     99000230        48563034        0.858824        48731730        0.852273
0     99001517
0     99002251
4     99003537        48566113        0.872449        48998000        0.853535        48997697        0.898477        48997075        0.903553
4     99003538        48566113        0.872449        48998000        0.853535        48997697        0.898477        48997075        0.903553
0     99005028

Take a look at the first result line, which contains the 4 hits for the query id 99000039. As before, the hit information alternates between the target ids and the target scores, but unlike the k-nearest search, the hits are not in a particular order. You can see that here where the scores are 0.859551, 0.872928, 0.878453, and 0.854054.

You might be wondering why I chose the 0.85 threshold. Quite simply, it was for presentation. With a threshold of 0.8, the first record has 43 hits, which requires rather an overwhelming 87 columns to show.

simsearch CSV output when no hits¶

In this section you’ll learn how to change the simsearch csv output behavior when a query has no hits. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

The CSV output format writes one output line for each query hit. What happens if a query has no hits? The previous section shows queries 99001517 and 99002251 have no hits with a threshold of 0.85, so let’s see what happens:

% simsearch --threshold 0.85 -q pubchem_queries.fps \
            pubchem_targets.fps --out csv | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,*,NaN
99002251,*,NaN
99003537,48566113,0.872449

You can see the queries with no hits get a synthetic output, by default with the target id “*” and the score of “NaN”. This makes it possible to identify queries with no hits.

Use --empty-target-id and --empty-score to change these:

% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps  \
    --out csv --empty-target-id MISSING --empty-score N/A | head
query_id,target_id,score
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,MISSING,N/A
99002251,MISSING,N/A
99003537,48566113,0.872449

Alternatively, use --no-include-empty to not generate an output line when there are not hits:

% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps  \
     --out csv --no-include-empty | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99003537,48566113,0.872449
99003537,48998000,0.853535
99003537,48997697,0.898477

Combined k-nearest and threshold search¶

In this section you’ll learn how to search a fingerprint file to find the k-nearest neighbors, where all of the hits must be at or above given threshold. You will need the fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

You can combine the -k and --threshold queries to find the k-nearest neighbors which are all at or above a given threshold:

This finds the nearest 2 structures, which all must be at least 0.7 similar to the query fingerprint. The output from the above starts:

#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.7
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2     99000039        48503376        0.878453        48503380        0.872928
2     99000230        48563034        0.858824        48731730        0.852273
0     99001517
2     99002251        48798046        0.810976        48625236        0.810651
2     99003537        48997075        0.903553        48997697        0.898477
2     99003538        48997075        0.903553        48997697        0.898477
2     99005028        48651160        0.828829        48848576        0.816667
2     99005031        48651160        0.828829        48848576        0.816667
2     99006292        48945841        0.965217        48737522        0.879310
2     99006293        48945841        0.965217        48737522        0.879310
0     99006597
2     99006753        48655580        0.931034        48662591        0.924855

Because this is a k-nearest search, the hits are sorted from highest score to lowest.

Saving simsearch results to “npz” format¶

In this section you’ll learn how to save the simsearch results to SciPy’s “npz” sparse binary format, instead of the default text-based output or the text-based CSV/TSV output.

As you saw earlier, the default simsearch output is in the text-oriented “simsearch” format, which looks like:

#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.85
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
4     99000039        48732162        0.859551        48503380        0.872928        48503376        0.878453        48520532        0.854054
2     99000230        48563034        0.858824        48731730        0.852273
0     99001517
0     99002251

This is a sparse matrix representation, because most values are below the threshold and therefore do not need to be included. The query id 99000039 only has 4 matches, even though pubchem_targets.fps has 14,753 fingerprints, because the remaining 14,749 fingerprints have a score of under 0.85.

While it’s easy to write a parser for this format, it’s even easier to use an existing parser, like the CSV or TSV output option.

However, that’s still text-oriented, which takes a while to parse, and it loses some precision (but not accuracy) because it tries to minimize the number of digits needed to distinguish between two different realizable scores.

Another option is to save the result in “npz” format, following the SciPy conventions for storing a sparse array. This is as simple as using a filename with the “.npz” extension:

% simsearch --threshold 0.85 -q pubchem_queries.fps \
      pubchem_targets.fps  -o pubchem_matches.npz
queries: 100%|███████████| 10969/10969 [00:00<00:00, 42721.52 fps/s]

This can be loaded into a SearchResults using chemfp.search.load_npz():

>>> import chemfp.search
>>> matches = chemfp.search.load_npz("pubchem_matches.npz")
>>> matches
SearchResults(#queries=10969, #targets=14753)
>>> matches.query_ids[:3]
array(['99116624', '99116625', '99116667'], dtype='<U8')
>>> matches.target_ids[:3]
array(['48942244', '48941399', '48940284'], dtype='<U8')
>>> matches.shape
(10969, 14753)
>>> matches[1048]
SearchResult(#hits=18)
>>> matches[1048].query_id
'99144558'
>>> matches[1048].target_ids
array(['48942244', '48941399', '48940284', ..., '48838025', '48730854',
       '48985180'], dtype='<U8')

which in turn can be converted into a SciPy “csr” (compressed sparse row matrix):

>>> matches.to_csr()
<10969x14753 sparse matrix of type '<class 'numpy.float64'>'
        with 69823 stored elements in Compressed Sparse Row format>

Alternatively, you can use SciPy to load the csr array directly:

>>> import scipy.sparse
>>> csr = scipy.sparse.load_npz("pubchem_matches.npz")
>>> csr
<10969x14753 sparse matrix of type '<class 'numpy.float64'>'
      with 69823 stored elements in Compressed Sparse Row format>

This SciPy option doesn’t give you access to the query or target ids, or the metadata about the query. Chemfp saves that information to the npz file, which when you get down to it is a zip file containing NumPy “npy” file with names that follow a common convention, which means it could (if you want to be really low-level) be accessed with the zipfile module:

>>> import zipfile
>>> z = zipfile.ZipFile("pubchem_matches.npz")
>>> for entry in z.infolist():
...  print(f"{entry.filename!r}, uncompressed size={entry.file_size}")
...
'chemfp.npy', uncompressed size=664
'indices.npy', uncompressed size=279420
'indptr.npy', uncompressed size=44008
'format.npy', uncompressed size=140
'shape.npy', uncompressed size=144
'data.npy', uncompressed size=558712
'query_ids.npy', uncompressed size=351136
'target_ids.npy', uncompressed size=472224

or with NumPy’s “load” function, which uses a dictionary-like interface for working with npz files:

>>> import numpy as np
>>> npz = np.load("pubchem_matches.npz")
>>> npz
NpzFile 'pubchem_matches.npz' with keys: chemfp, indices, indptr, format, shape...
>>> npz["target_ids"]
array(['48942244', '48941399', '48940284', ..., '48838025', '48730854',
       '48985180'], dtype='<U8')
>>> import pprint, json
>>> pprint.pprint(json.loads(str(npz["chemfp"])))
{'alpha': 1.0,
 'beta': 1.0,
 'format': 'multiple',
 'matrix_type': 'NxM',
 'method': 'Tversky',
 'num_bits': 881,
 'shape': [10969, 14753]}

NxN (self-similar) searches¶

In this section you’ll learn how to use the same fingerprints as both the queries and targets, that is, a self-similarity search. You will need the pubchem_queries.fps fingerprint file generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

Use the --NxN option if you want to use the same set of fingerprints as both the queries and targets. Using the pubchem_queries.fps from the previous sections:

simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps

This code is very fast because there are so few fingerprints. For larger files the --NxN will be about twice as fast and use half as much memory compared to:

simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps

In addition, the --NxN option excludes matching a fingerprint to itself (the diagonal term).

Using a toolkit to process the ChEBI dataset¶

In this section you’ll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, RDKit and CDK. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.)

We’ll work with data from ChEBI, which are “Chemical Entities of Biological Interest”. They distribute their structures in several formats, including as an SD file. For this section, download the “lite” version from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 226 the file contains 187,935 records and the compressed file is 65M.

ChEBI record titles don’t contain the id¶

Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 101,902 of the title lines are empty, 39,459 have the title “null”, 4,236 have the title ” ” (with a single space), 1,940 have the title “ChEBI”, 56 of them are labeled “Structure #1”, and the others are usually compound names like “fluoresone” and “bkas#30-CoA(4-)”, or identifiers like “Compound 92” or “145453870”.

For more than a decade I’ve asked ChEBI to fix this, and include the identifer in the title line, to no success. Perhaps you have more influence?

Instead, the record id is stored as value of the “ChEBI ID” tag, which looks like:

> <ChEBI ID>
CHEBI:90

By default the toolkit-based fingerprint generation tools use the title as the identifier, and print a warning and skip the record if the identifier is missing. Here’s an example with rdkit2fps:

% rdkit2fps ChEBI_lite.sdf.gz
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
... keeps on going ...
Error: Empty title in SD record after cleanup, file 'ChEBI_lite.sdf.gz', line 2261, record #34: first line is ' '. Skipping.
... keeps on going ...

These “Missing title” messages come from records where no identifier could be found and the “after cleanup” messages come after chemfp removes leading and trailing whitespace. The default is to report the problem to stderr, skip processing the record, and continue on to the next record. Use the --errors option to change the default behavior.

(If the first 100 records have no identifiers then the command-line tools will exit even if --errors is ignore. This is a safety mechanism. Let me know if it’s a problem.)

The `--id-tag` option¶

If the identifier isn’t in the title line but is in one of the SD data items, then use --id-tag option to specify of the name of the data tag containing the id. For this data set you’ll need to write it as:

The quotes are important because of the space in the tag name.

Here’s what that looks like:

% rdkit2fps ChEBI_lite.sdf.gz --id-tag "ChEBI ID" | head -8 | fold
#FPS1
#num_bits=2048
#type=RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7
#software=RDKit/2023.09.5 chemfp/4.2
#source=ChEBI_lite.sdf.gz
#date=2024-06-19T09:33:51+00:00
10208220141258c184490038b4124609db0030024a0765883c62c9e1288a1dc224de62f445743b8b
30ad542718468104d521a214227b29ba3822fbf20e15491802a051532cd10d902c39b02b51648981
9c87eb41142811026d510a890a711cb02f2090ddacd990c5240cc282090640103d0a0a8b460184f5
11114e2a8060200804529804532313bb03912d5e2857a6028960189e370100052c63474748a1c000
8079f49c484ca04c0d0bcb2c64b72401042a1f82002b097e852830e5898302021a1203e412064814
a598741c014e9210bc30ab180f0162029d4c446aa01c34850071e4ff037a60e732fd85014344f82a
344aa98398654481b003a84f201f518f      CHEBI:90
00000000080200412008000008000004000010100022008000400002000020100020006000800001
01000100080001000010000002002200000200000008000000400002100000000080000004401000
80200020800200002000001400022064000004244810000000000080000a80012002020004198002
00080200020020120040203001000802010100024211000004400000000100200003000001000100
0100021000a200601080002a00002020048004030000884084000008000002040200010800000000
2000010022000800002000020001400020800100025040000000200a080244000060008000000802
8100c801108000000041c00200800002      CHEBI:165

In addition to “ChEBI ID” there’s also a “ChEBI Name” tag which includes data values like “tropic acid” and “(+)-guaia-6,9-diene”. Every ChEBI record has a unique name so the names could also be used as the primary identifier instead of its id.

To use the ChEBI Name as the primary chemfp identifier, specify:

The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other bytes, so it’s no problem using those names directly.

Generate fingerprints with Open Babel¶

If you have the Open Babel Python library installed then you can use ob2fps to generate fingerprints. (See chemfp.ob2fps() for the equivalent Python function.)

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

This takes about 3.5 minutes on my 2020-era M1 MacBook Pro laptop to process all of the records, and generates messages like:

==============================
*** Open Babel Warning  in Expand
  Alias R was not chemically interpreted

==============================
*** Open Babel Warning  in ReadMolecule
  WARNING: Problem interpreting the valence field of an atom
The valence field specifies a valence 3 that is
less than the observed explicit valence 4.

==============================
*** Open Babel Warning  in ReadMolecule
  Failed to kekulize aromatic bonds in MOL file

==============================
*** Open Babel Warning  in ReadMolecule
  Invalid line: M RGP must only refer to pseudoatoms
M  RGP  2  12   1  15   2

The default generates FP2 fingerprints, so the above is the same as:

ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

ob2fps can generate several other types of fingerprints. (Use ob2fps --help for a list.) For example, to generate the Open Babel implementation of the MACCS definition specify:

ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

By default ob2fps shows a progress bar which looks like:

ChEBI_lite.sdf.gz: 90365 recs [02:21, 507.25 recs/s]

Use --no-progress to not use a progress bar.

The underlying Open Babel toolkit does not have a way to get the current location in the file, so the progress bar is only able to show the number of records processed, and not the percentage complete.

ob2fps has an alternative implementation which uses chemfp’s text toolkit to parse each record as a string, which is then passed to Open Babel. The “chemfp” implementation is able to get the current file position, letting ob2fps show a percentage complete progress bar. Use the -R option to set the “implementation” reader argument to “chemfp”, as in the following:

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi2.fps \
    -R implementation=chemfp

which shows:

ChEBI_lite.sdf.gz:  53%|█████▌        | 36.0M/67.8M [01:58<01:27, 362kbytes/s]

The chemfp implementation took few seconds longer than the native Open Babel implementation.

Generate fingerprints with OpenEye¶

If you have the OEChem Python library installed, with licenses for OEChem and OEGraphSim, then you can use oe2fps to generate fingerprints. (See chemfp.oe2fps() for the equivalent Python function.)

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps

This takes about 1 minute on my laptop and generates a number of warnings like “Stereochemistry corrected on atom number 17 of”, “Unsupported Sgroup information ignored”, and “Invalid stereochemistry specified for atom number 9 of”. Normally the record title comes after the “… of”, but the title is blank for most of the records.

As an historical note, old ChEBI releases include empty structure records, like CHEBI:147324. By default OEChem’s SDF reader skips empty structure records. If you really need those records, add the SuppressEmptyMolSkip flag to the default ‘flavor’ reader argument, like this:

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps \
    -R flavor=Default,SuppressEmptyMolSkip

The default settings generate OEGraphSim path fingerprint with the values:

numbits=4096 minbonds=0 maxbonds=5
   atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral

Each of these can be changed through command-line options. Use oe2fps --help for details.

oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition specify:

oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
     -o chebi_maccs.fps

Use oe2fps --help for a list of available oe2fps fingerprints or to see more configuration details.

By default oe2fps shows a progress bar which looks like:

ChEBI_lite.sdf.gz:  24%|██      | 13.0M/53.6M [00:14<00:50, 808kbytes/s]

Use --no-progress to not use a progress bar.

Generate fingerprints with RDKit¶

If you have the RDKit Python library installed then you can use rdkit2fps to generate fingerprints. (See chemfp.rdkit2fps() for the equivalent Python function.)

Based on the previous examples you probably guessed that the command-line is:

rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps

This takes about 4.4 minutes on my laptop, and RDKit did not generate fingerprints for 131 of the 187,935 structures. RDKit logs warning and error messages to stderr. They look like:

[14:06:39] WARNING: not removing hydrogen atom without neighbors
[14:06:39] Explicit valence for atom # 7 O, 3, is greater than permitted
[14:06:40] Explicit valence for atom # 0 He greater than permitted
[14:06:08]

****
Post-condition Violation
Element 'hv' not found
Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2021_09_4/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****

[14:06:08] Element 'hv' not found

For example, RDKit is careful to check that structures make chemical sense. It rejects 3-valent oxygens and refuses to process that those structures, which is the reason for the first line of that output.

The default generates RDKit’s path fingerprints with parameters:

minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1

Each of those can be changed through command-line options. See rdkit2fps rdkit2fps --help for details, where you’ll also see a list of the other available fingerprint types.

For example, to generate the RDKit implementation of the MACCS definition use:

rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
       -o chebi_maccs.fps

while the following generates the Morgan/circular fingerprint with radius 3:

rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz

By default rdkit2fps shows a progress bar which looks like:

ChEBI_lite.sdf.gz:  53%|████   | 35.8M/67.8M [02:18<02:04, 257kbytes/s]

Use --no-progress to not use a progress bar.

Generate fingerprints with CDK¶

If you have the CDK Java JAR file on your CLASSPATH and you have installed the JPype1 package (see the installation guide for help) then you can use cdk2fps to generate fingerprints. (See chemfp.cdk2fps() for the equivalent Python function.)

Based on the previous examples you probably guessed that the command-line is:

cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps

However, that would be incomplete. The CDK fingerprint type is “CDK-Daylight”, which generates Daylight-like fingerprints, and the documentation says that all hydrogens must be explicit.

This requires the -R option, to set the CDK reader argument “hydrogens” to “make-explicit”, which converts all implicit hydrogens to explicit, before generating the fingerprints:

cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
     -R hydrogens=make-explicit -o cdk_chebi.fps

(Use cdk2fps --help to see which fingerprint types need explicit hydrogens and which need implicit hydrogens.)

This takes about 3.4 minutes on my laptop, of which 1.6 seconds is spent in making explicit hydrogens (Without the reader argument it only takes 1.8 minutes), and CDK did not generate fingerprints for 14 of the 187,935 structures. CDK generated various warning and error messages, including:

org.openscience.cdk.io.iterator.IteratingSDFReader ERROR:
     Error while reading next molecule: Atom is not a member of this AtomContainer
org.openscience.cdk.config.IsotopeFactory ERROR:
     Could not find major isotope for: 88

Chemfp lets you choose an alternate error handler (see the next section), which can help you figure out which structures could not be processed. I’ll enable the report error handler:

cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps \
     --errors report

This generates 11 lines of the form:

Error: Cannot generate fingerprint: java.lang.NullPointerException,
  file 'ChEBI_lite.sdf.gz', record #8683. Skipping.

That means the record caused the CDK fingerprinter function to fail, by raising a Java NullPointerException, which chemfp catches and reports. For reference, that record is ChEBI ID CHEBI:5015. I reported it (in June 2024), and it was quickly fixed.

It’s a bit tricky to figure out that record #8683 is CHEBI:5015 because the record’s initial line number isn’t shown, in turn because it isn’t available from the CDK API. cdk2fps has an alternative implementation which uses chemfp’s text toolkit to parse each record as a string, which is then passed to the CDK. The “chemfp” implementation is able to report the current line number. It is enabled with the cdk.sdf.implementation value “chemfp”, like this:

cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi2.fps \
   --errors report -R hydrogens=make-explicit -R implementation=chemfp

This adds line number information to the report:

Error: Cannot generate fingerprint: java.lang.NullPointerException,
file 'ChEBI_lite.sdf.gz', line 556565, record #8683. Skipping.

This variant implementation took 3.17 minutes (1.7 minutes without making explicit hydrogens), so in this case was faster than CDK’s own SDF parser. It used to 50% slower (without adding hydrogens), so I don’t know what’s happened. In any case, let me know if you find it useful.

The default cdk2fps fingerprint type is CDK-Daylight with parameters:

size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0

Each of those can be changed through command-line options. See cdk2fps --help for details, where you’ll also see a list of the other available fingerprint types.

For example, to generate the CDK implementation of the MACCS definition use:

cdk2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
  -R hydrogens=make-explicit -o chebi_maccs.fps  --errors report

(CDK’s MACCS fingerprints also require explicit hydrogens.)

This generates 29 report lines of the form:

Error: Cannot generate fingerprint: java.lang.NullPointerException:
Aromaticity model requires implicit hydrogen count is set., file
'ChEBI_lite.sdf.gz', record #2935. Skipping.

By default cdk2fps shows a progress bar which looks like:

ChEBI_lite.sdf.gz:  25%|███         | 17.2M/67.8M [00:48<02:11, 384kbytes/s]

Use --no-progress to not use a progress bar.

Use structures as input to simsearch¶

In this section you’ll learn how to use structures as queries to simsearch queries instead of fingerprints. You’ll need a chemistry toolkit and a fingerprint file generated from that toolkit. This section assumes you have one of the chebi_maccs.fps ChEBI fingerprint files generated in the previous section. My example will use the RDKit-generated file created using:

rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
     -o chebi_maccs.fps

I’ll search the ChEBI data set for phenol with the SMILES “c1ccccc1O”. My chebi_maccs.fps starts:

% head -3 chebi_maccs.fps
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2

The type information gives a hint of how to generate a fingerprint query for that data set, which you can do manually using rdkit2fps:

% echo "c1ccccc1O phenol" | rdkit2fps --maccs166
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#date=2024-06-19T12:04:30
00000000000000000000000000000140004480101e    phenol

I could then use a pipe to pass the rdkit2fps output as input to simsearch:

% echo "c1ccccc1O phenol" | rdkit2fps --maccs166 | \
       simsearch chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1     phenol  CHEBI:15882     1.00000

That’s a bit clumsy, because I have to look at the fingerprint file and figure out which command-line tools and options to use.

A simpler way is to pass the structures directly to simsearch, either as a command-line option or from an input file. In the following I’ll pass in the SMILES using the `--query` parameter:

% simsearch chebi_maccs.fps --threshold 1.0 --query "c1ccccc1O phenol"
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1     phenol  CHEBI:15882     1.00000

How does it work? Chemfp opens the targets targets file to read the metadata section. It then uses the type string to figure out how to generate the fingerprints for that type, as well as figure out which toolkit to use for structure processing.

If the query structure is not a SMILES String then use --query-format to specify the format name. Use --query-id to specify the query id instead of using the id from the input record.

For example, the following uses the InChI for proline as the input, sets the query id to “proline”, and finds the two nearest neighbors in ChEBI:

% simsearch chebi_maccs.fps  --query-format inchi --query \
   'InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H2,(H,7,8)/t4-/m0/s1' \
   --query-id proline -k 2
#Simsearch/1
#num_bits=166
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
2     proline CHEBI:17203     1.00000 CHEBI:16313     1.00000

CHEBI:17203 is “L-proline” and CHEBI:16313 is “D-proline”.

The --query simsearch option takes the structure on the command-line. Use --queries to read queries from a file, which may be a fingerprint file or a structure file.

I’ll demonstrate with a SMILES file containing two records:

% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine

which I’ll use to search chebi_maccs.fps:

% simsearch --queries simple.smi chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#queries=simple.smi
#targets=chebi_maccs.fps
#query_source=simple.smi
#target_source=ChEBI_lite.sdf.gz
1     phenol  CHEBI:15882     1.00000
3     caffeine        CHEBI:27732     1.00000 CHEBI:178066    1.00000 CHEBI:190027    1.00000

See simsearch CSV output to learn how to have simsearch generate CSV or TSV output.

By default simsearch uses the filename to figure out the format type and compression. Use --query-format to specify a different format. For example, if neither --query nor --queries are specified then the default reads FPS queries from stdin. I’ll use --query-format sdf.gz to have it read gzip-compressed SD records from stdin, in this case from a PubChem file:

% cat Compound_099000001_099500000.sdf.gz | \
     chemfp simsearch --query-format sdf.gz chebi_maccs.fps -k 1 | head -10
#Simsearch/1
#num_bits=166
#type=Tanimoto k=1 threshold=0.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1     99000039        CHEBI:220837    0.87755
1     99000230        CHEBI:120636    0.84146
1     99001517        CHEBI:134851    0.74419
1     99002251        CHEBI:218232    0.79630

Make new fingerprints matching the type in an existing file¶

In this section you’ll learn how to generate fingerprints that match the fingerprint type of an existing file. You’ll need a chemistry toolkit and a fingerprint file generated from that toolkit. This section assumes you have one of the chebi_maccs.fps ChEBI fingerprint files generated in an earlier section.

In the previous section you learned how to use structures as input to simsearch; simsearch uses the target fingerprint file metadata to generate the appropriate fingerprints for each structure. What if you want to generate fingerprints appropriate for the target data set but don’t immediately want to use them for a search?

The --using FILENAME option to cdk2fps, oe2fps, rdkit2fps, and ob2fps opens the named fingerprint file to get the appropriate type. I’ll demonstrate with a simple SMILES file, where the fingerprint types comes from chebi_maccs.fps:

% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
% rdkit2fps simple.smi --using chebi_maccs.fps
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:10:51
00000000000000000000000000000140004480101e    phenol
000000003000000001d414d91323915380f138ea1f    caffeine

How do you know to use rdkit22fps instead of one of the other programs? You don’t, but all of the chemfp programs will try to process fingerprint types from other toolkits:

% ob2fps simple.smi --using chebi_maccs.fps
WARNING: ob2fps uses the openbabel toolkit but the --using file 'chebi_maccs.fps' specifies 'RDKit-MACCS166/2' which uses the rdkit toolkit.
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:11:12
00000000000000000000000000000140004480101e    phenol
000000003000000001d414d91323915380f138ea1f    caffeine

This cross-toolkit functionality is part of the long-term chemfp design.

If you know the chemfp fingerprint type string then you could pass that in on the command-line via the --type option, as in:

% ob2fps --type "OpenBabel-ECFP2 nBits=128" simple.smi
#FPS1
#num_bits=128
#type=OpenBabel-ECFP2/1 nBits=128
#software=OpenBabel/3.1.0 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:12:55
00080020000a00000c00000000000000      phenol
000c4448000000880404000221002040      caffeine

Alternate error handlers¶

In this section you’ll learn how to change the error handler for rdkit2fps using the --errors option.

By default the “<toolkit>2fps” programs “ignore” structures which could not be parsed into a molecule option. There are two other options. They can “report” more information about the failure case and keep on processing, or they can be “strict” and exit after reporting the error.

This is configured with the --errors option.

Here’s the rdkit2fps output using --errors report:

[12:21:03] WARNING: not removing hydrogen atom without neighbors
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Skipping.
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24338, record #381. Skipping.

The first two lines come from RDKit. The third line is from chemfp, reporting which record could not be parsed. (The record starts at line 24228 of the file.) The fourth line is another RDKit error message, and the last line is another chemfp error message.

Here’s the rdkit2fps output using --errors strict:

[12:24:24] WARNING: not removing hydrogen atom without neighbors
[12:24:24] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Exiting.

Because this is strict mode, processing exits at the first failure.

The ob2fps, cdk2fps and oe2fps tools implement the --errors option, but they aren’t as useful as rdkit2fps because the underlying APIs don’t give useful feedback to chemfp about which records failed. For example, the standard OEChem file reader automatically skips records that it cannot parse. Chemfp can’t report anything when it doesn’t know there was a failure.

The ob2fps and cdk2fps readers have an alternate SDF reader implementation, enabled when the “implementation” reader argument is “chemfp”, which uses chemfp’s own SDF record reader to read the records, then passes the result to the toolkit for further processing. The chemfp implementation supports somewhat better error reporting than the native toolkits. There is no equivalent for OEChem because chemfp’s record reader is so slow by comparison.

The default error handler in chemfp 1.1 was “strict”. In practice this proved more annoying than useful because most people want to skip the records which could not be processed. They would then contact me asking what was wrong, or doing some pre-processing to remove the failure cases.

One of the few times when it is useful is for records which contain no identifiers. When I changed the default from “strict” to “ignore” and tried to process ChEBI, I was confused at first about why the output file was so small. Then I realized that it’s because the many records without a title were skipped, and there was no feedback about skipping those records.

I changed the code so missing identifiers are always reported, even if the error setting is “ignore”. Missing identifiers will still stop processing if the error setting is “strict”.

chemfp’s two cross-toolkit substructure fingerprints¶

In this section you’ll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, RDKit, and CDK. (For OpenEye users, these fingerprints use the base OEChem library but do not use the separately licensed OEGraphSim library.)

chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other comes from the 881-bit PubChem/CACTVS substructure fingerprints.

The chemfp MACCS definition is called “rdmaccs” because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, while OpenEye has a completely independent implementation.)

Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits.

Open Babel:

% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenBabel/2
#software=OpenBabel/3.1.0 chemfp/4.2
#date=2024-06-19T12:18:16
00000000000000000000000000000140004480101e    phenol

OpenEye:

% echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenEye/2
#software=OEChem/2.3.0 (20191016) chemfp/3.5
#date=2021-01-27T14:46:03
00000000000000000000000000000140004480101e    phenol

RDKit:

% echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-RDKit/2
#software=RDKit/2021.09.4 chemfp/4.2
#date=2024-06-19T12:18:40
00000000000000000000000000000140004480101e    phenol

CDK:

% echo "c1ccccc1O phenol" | cdk2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-CDK/2
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:19:56
00000000000000000000000000000140004480101e    phenol

For more complex molecules it’s possible that different toolkits produce different fingerprint rdmaccs, even though the toolkits use the same SMARTS definitions. Each toolkit has a different understanding of chemistry. The most notable is the different definition of aromaticity, so the bit for “two or more aromatic rings” will be toolkit dependent.

There should be no reason to convert all hydrogens to explicit when using CDK’s rdmaccs implementation.

substruct fingerprints¶

chemp also includes a “substruct” substructure fingerprint. This is an 881 bit fingerprint derived from the PubChem/CACTVS substructure keys. They do not match the CACTVS fingerprints exactly, in part due to differences in ring perception. Some of the substruct bits will always be 0. With that caution in mind, if you want to try them out, use the --substruct option.

The term “substruct” is a horribly generic name. If you can think of a better one then let me know. Until chemfp 3.0 I said these fingerprints were “experimental”, in that I hadn’t fully validated them against PubChem/CACTVS and could not tell you the error rate. I still haven’t done that.

What’s changed is that I’ve found out over the years that people are using the substruct fingerprints, even without full validatation. That surprised me, but use is its own form of validation. I still would like to validate the fingerprints, but it’s slow, tedious work which I am not really interested in doing. Nor does it earn me any money. Plus, if the validation does lead to any changes, it’s easy to simply change the version number.

Here is an example using the CDK:

% echo "c1ccccc1O phenol" | cdk2fps --in smi --substruct | fold
#FPS1
#num_bits=881
#type=ChemFP-Substruct-CDK/1
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:22:43
010604000000000000000000000000000000000000000c0000000000000000800000005800001000
0010200109000c6001004000010004420000400000040400101100601011106444418848010e0024
03881019e00102000000000000000000000000000000000000000000000000        phenol

Change the “cdk2fps” as appropriate for the other toolkits.

The CDK “substruct” fingerprint should not require explicit hydrogens. If you are using CDK 2.9 or later then consider using the toolkit’s own PubChem fingerprinter, which should be a better fit to the CACTVS ring perception used in the PubChem fingerprints:

% echo "c1ccccc1O phenol" | cdk2fps --in smi --substruct | fold
#FPS1
#num_bits=881
#type=CDK-Pubchem/2.9
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:25:58
010604000000000000000000000000000000000000000c0000000000000000800000005800001000
0010200109000c6001004000010004420000400000040400101100601011106444418848010e0024
03881019e00102000000000000000000000000000000000000000000000000        phenol

For this simple phenol example, both give the same result.

(If you are using a CDK older than 2.9 then you will need to convert all hydrogens to explicit before generating --Pubchem fingerprints.)

Generate binary FPB files from a structure file¶

In this section you’ll learn how to generate an FPB file instead of an FPS file. You will need the ChEBI file from Using a toolkit to process the ChEBI dataset and a chemistry toolkit. The FPB format was introduced with chemfp-2.0.

Note

Several chemfp features, like creating FPB files, require a valid license key. If you are using chemfp under the Base License Agreement then contact sales@dalkescientific.com to purchase a license key or request an evaluation license.

The FPB format was designed so the fingerprints can be memory-mapped directly to chemfp’s internal data structures. This makes it very fast to load, but unlike the FPS format, it’s not so easy to write with your own code. You should think of the FPB format as an binary application format, for chemfp-based tools, while the FPS format is a text-based format for data exchange between diverse programs.

The easiest way to generate an FPB file from the command line is to use the “.fpb” extension instead of “.fps” or “.fps.gz”. Here are examples using each of the toolkits.

Open Babel:

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb

OpenEye:

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb

RDKit:

rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb

CDK:

cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fpb

The binary format isn’t human-readable. Use fpcat to see what’s inside:

% fpcat oe_chebi.fpb
#FPS1
#num_bits=4096
#type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.4.3 (20191016) chemfp/3.4
0000000 ... many zeros ...00000000000000      CHEBI:15378
0000000 ... many zeros ...00000000000000      CHEBI:16042
0000000 ... many zeros ...00000000000000      CHEBI:17792
           ....
182b528 ... many hex values ... a8c10c0c      CHEBI:60493

By default the fingerprints are ordered from smallest popcount to largest, which you can see in the output. A pre-ordered index is faster to search because the target popcounts are pre-computed and because it often reduces the search space.

If you want to preserve the input order then you’ll need to pipe the FPS output to fpcat and use its --preserve-order flag. See the next section for an example.

Convert between FPS and FPB formats¶

In this section you’ll learn how to convert an FPS file into an FPB file and back, and you’ll learn how to control the fingerprint ordering. You will need the FPS files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit. The FPB format was introduced with chemfp-2.0.

If you already have an FPS file then you can convert it directly into an FPB file, and without using a chemistry toolkit. The fpcat program converts from one format to the other.

In an earlier section I generated the files pubchem_queries.fps and pubchem_targets.fps . I’ll convert each to FPB format:

% fpcat pubchem_targets.fps -o pubchem_targets.fpb
% fpcat pubchem_queries.fps -o pubchem_queries.fpb

The FPB format is a binary format which is difficult to read directly. The easiest way to see what’s inside is to use fpcat. If you don’t specify an output filename then it sends the results to stdout in FPS format:

% fpcat pubchem_queries.fpb | head -5 | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
00028000e00000000000000000000000000000000000000000000000000000000000009840000000
0000c001000300000000000000000000000000000000000000000200000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000        99116624

The keen-eyed reader might have noticed that the conversion does not have a “source” or “date” field. I haven’t figured out if this is a bug. Should I keep the original date and structure file source, or use the current date and FPS file source? Let me know if this is important to you.

By default when fpcat generates an FPB file it reorders the fingerprints by population count and creates a popcount index. This improves the similarity search performance, but it means that the order of the FPB file is likely different than the original FPS format. You can get a sense of this by looking at the first fingerprint in the original pubchem_queries.fps file:

% grep -v # pubchem_queries.fps  | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000        99000039

and confirming that it isn’t the same as the first fingerprint in pubchem_queries.fpb.

If you want the FPB file to store the fingerprints in input order instead of the popcount order needed for optimized similarity search, then use the --preserve-order flag:

% fpcat pubchem_queries.fps --preserve-order -o input_order.fpb
% fpcat input_order.fpb | grep -v # | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000        99000039

On the flip side, fpcat by default preserves the input order when it creates FPS output. If you instead want to the output FPS file to be in popcount order then use the --reorder flag:

% fpcat --reorder pubchem_queries.fps | grep -v # | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000        99000039

Specify the fpcat output format¶

In this section you’ll learn how to specify the output format for fpcat using a command-line option instead of the filename extension. You will need the pubchem_queries.fpb file from Generate fingerprint files from PubChem SD tags.

If you do not specify an output filename then fpcat will output the fingerprints in FPS format to stdout. If you specify a filename then by default it will look at the extension to determine if the output should be an FPB (“.fpb”), FPS (“.fps”), or gzip or Zstandard compressed FPS (“.fps.gz” or “.fps.zst”) file. The FPS format is used for unrecognized extensions.

In a few rare cases you may want to use a format which doesn’t match the default. To be honest, the examples I can think of aren’t that realistic, but let’s suppose you want to output the contents of an FPB file to stdout in gzip’ed FPS format, and count the number of bytes in compressed output. I’ll use the use the –out flag to change the format to ‘fps.gz’ from the default of ‘fps’, then compare the resulting size with the uncompressed form:

% fpcat pubchem_queries.fpb --out fps | wc -c
2544890
% fpcat pubchem_queries.fpb --out fps.gz | wc -c
320828

It’s not that useful because you could pipe the uncompressed output to gzip, which is also likely faster:

% fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c
320819

In case you’re wondering, chemfp 3.4 added support for zstandard compression, if the “zstandard” Python module is available.

Chemfp cannot write an FPB file to stdout. In fact, the output file must be seek-able, which means it can’t be a named pipe either.

Alternate fingerprint file formats¶

In this section you’ll learn about chemfp’s support for other fingerprint file formats.

Chemfp started as a way to promote the FPS file format for fingerprint exchange. Chemfp 2.0 added the FPB format, which is a binary format designed around chemfp’s internal search data structure so it can be loaded quickly.

There are many other fingerprint formats. Perhaps the best known is the Open Babel FastSearch format. Two others are Dave Cosgrove’s flush format, and OpenEye’s “fpbin” format.

The chemfp_converters package contains utilities to convert between the chemfp formats and these other formats.

## Convert from/to Dave Cosgrove Flush format
% flush2fps drugs.flush
% fps2flush drugs.fps -o drugs.flush

## Convert from/to OpenEye's fpbin format
% fpbin2fps drugs.fpbin --moldb drugs.sdf
% fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin

## Convert from/to Open Babel's FastSearch format
% fs2fps drugs.fs --datafile drugs.sdf
% fps2fs drugs_openbabel_FP2.fps  --datafile drugs.sdf  -o drugs.fs

Of the three formats, the flush format is closest to the FPS data model. That is, it stores fingerprint records as an identifier and the fingerprint bytes. By comparison, the FastSearch and fpbin formats store the fingerprint bytes and an index into another file containing the structure and identifier. It’s impossible for chemfp to get the data it needs without reading both files.

Chemfp has special support for the flush format. If chemfp_converters is installed, chemfp will use it to read and write flush files nearly everywhere that it accepts FPS files. You can use it at the output to oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch, and as both input and output to fpcat. (You cannot use it as the simsearch targets because that code has been optimized for FPS and FPB search, and I haven’t spent the time to optimize flush file support.)

This means that if chemfp_converters is installed then you can use fpcat to convert between FPS, FPB, and and flush file formats. For examples:

% fpcat drugs.flush -o drugs.fps
% fpcat drugs.fps -o drugs.flush

In addition, you can use it at the API level in chemfp.open(), chemfp.load_fingerprints(), chemfp.open_fingerprint_writer(), and FingerprintArena.save().

Note that the flush format does not support the FPS metadata fields, like the fingerprint type, and it only support fingerprints which are a multiple of 32 bits long. Also, compressed flush files are not supported.

The FPB format¶

In this section you’ll learn about the FPB format.

The FPS format is a human-readable text format. It’s meant to be easy to create software to read or write FPS files so it can be used as a way to exchange fingerprint data between diffferent programs. The downside is it’s relatively slow to process. Chemfp can search about 1M 2048-bit FPS fingerprints in one second, or load about 250K 2048-bit fingerprints/second into memory in the same time.

The FPB format is a binary format which is much faster to load, though internally more complex. It can search 1M fingerprints in about 10 millisecond, and the load time mostly depends on the file system performance.

This makes a big difference in web development, where the web app might restart every time a file is edited, and in command-line tools, where the load time might be far greater than the analysis time.

Note though that the Base License Agreement does not permit people to create FPB files, and the chemfp binary distributions contain a license manager which restricts access to that feature without an authorized license key. See the chemfp licensing page for licensing options and for how to request a license key and/or quote for license.

Internally the FPB file contains an 8-byte signature followed by a set of chunks. Each chunk contains an 8-byte length field followed a 4-byte identifier followed by n bytes of data, where n is the value of the length field. Those familiar with PNG or other FourCC format will find this familiar.

Different chucks contain different types of data. For example, the “META” chunk contains the metadata, and the “AREN” chuck contains the fingerprint data, organized in a way that makes them easy to load into a chemfp arena, hence the name.

Licensed FPB files¶

While the Base License Agreement does not permit people to create an FPB file, it does allow people to load an FPB file, including FPB files generated by third-party tools.

However, the Base License Agreement also does not generally permit in-memory searches of fingerprint data sets with more than 50,000 fingerprints. If you try you’ll get a message like the following:

A valid chemfp license key is required to search an arena with more than 50,000 fingerprints.
The license key check failed. Neither CHEMFP_LICENSE nor CHEMFP_LICENSE_FILE environment variables are defined.
Email sales@dalkescientific.com to purchase a license key or request a demo license key.
Cannot run simsearch. Exiting.

The exception is if the FPB file is a licensed FPB file.

A licensed FPB file has an embedded license key (stored in the “CFPL” chunk) which, if valid, unlocks all of the license manager restrictions for using that file.

Get licensed FPB files containing ChEMBL 34 fingerprints¶

In this section you’ll learn how to get the ChEMBL 34 fingerprints as a chemfp licensed FPB file. For background you should read the previous section on the FPB format.

Short version: use the following to download and uncompress the file:

curl -O https://chemfp.com/datasets/chembl_34.fpb.gz
gunzip chembl_34.fpb.gz

If curl isn’t installed, try:

wget https://chemfp.com/datasets/chembl_34.fpb.gz

The result is the licensed FPB file chemfp_34.fpb.

Note

The “licensed” in “licensed FPB file” only refers to the presence of the embedded chemfp license key. The data in the file is distributed under the terms of the ChEMBL license and includes the required attribution.

Long version: The ChEMBL 34 distribution includes pre-computed RDKit Morgan2 fingerprints in FPS format. These can be searched directly, like:

% simsearch --query c1ccccc1CN -k 2 chembl_34.fps.gz --times
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#targets=chembl_34.fps.gz
#target_source=chembl_34.fps.gz
2     Query1  CHEMBL522       1.0000000       CHEMBL14186     0.9333333
open 0.01 read 0.00 search 2.61 output 0.00 total 2.79

or decompressed first, using “gunzip”, to skip the decompression overhead:

% gunzip chembl_34.fps.gz
% simsearch --query c1ccccc1CN -k 2 chembl_34.fps --times --no-metadata
2     Query1  CHEMBL522       1.0000000       CHEMBL14186     0.9333333
open 0.01 read 0.00 search 1.31 output 0.00 total 1.50

The chemfp project also distributes ChEMBL fingerprints which have been reformatted to the FPB format. Many of the examples in the documentation will use chembl_34.fpb. Here’s how to get that file.

Step 1: Download chembl_34.fpb.gz.

Step 2: Decompress it with:

gunzip chembl_34.fpb.gz

Step 3: (optional) View the license terms:

chemfp fpb_text chembl_34.fpb

Similarity search with the FPB format¶

In this section you’ll learn how to do a similarity search using an FPB file as the target. You will need ChEMBL 34 as a chemfp-licensed FPB file and you will need the RDKit chemistry toolkit.

The file chembl_34.fpb contains the ChEMBL-generated RDKit Morgan circular fingerprints for ChEMBL 34, reformatted in FPB format, and containing a license key which unlocks restrictions on using chemfp to work with that file.

All of the chemfp tools support FPB files as input or output formats, including simsearch. Here’s an example using first FPS format then FPB format (run three times each, reporting only the fastest time:

% time simsearch --query c1ccccc1CN -k 2 chembl_34.fps > /dev/null
2.073u 0.427s 0:01.57 158.5%  0+0k 0+0io 47pf+0w

% time simsearch --query c1ccccc1CN -k 2 chembl_34.fpb > /dev/null
0.560u 0.609s 0:00.34 341.1%  0+0k 0+0io 77pf+0w

These times are reported as “user time” (the CPU time spent by the program), “system time” (the time spent by the operating system, in this case, to read and transfer data from disk), and “wall clock time”, which is the overall elapsed time. In this case, it took 2.1 seconds to search the FPS file and 0.56 seconds to search the FPB file.

That’s a factor of 4, which is pretty good, but perhaps not that impressive.

Performance breakdown¶

That’s because the total time includes the time needed to load RDKit, parse the query SMILES, and generate the query fingerprint. To start, it takes 0.04 seconds for Python to start working:

% time python -c 'pass'
0.033u 0.009s 0:00.04 75.0%   0+0k 0+0io 0pf+0w

It takes Python about 0.25 seconds to load RDKit:

% time python -c 'from rdkit.Chem import rdMolDescriptors'
0.673u 0.651s 0:00.25 528.0%  0+0k 0+0io 24pf+0w

Finally, chemfp’s wrapper to the RDKit toolkit adds another 0.04 seconds:

% time python -c 'from chemfp import rdkit_toolkit, rdkit_types'
0.838u 0.621s 0:00.29 500.0%  0+0k 0+0io 18pf+0w

Note

The times depend very much on your operating system, the location and type of the file system, the file cache, and more.

To demonstrate this overhead, I’ll pre-compute the fingerprints so chemfp doesn’t need to import RDKit:

% cat benzylamine.smi
c1ccccc1CN benzylamine
% rdkit2fps benzylamine.smi -o benzylamine.fps \
     --using chembl_34.fpb --no-progress
% tail -1 benzylamine.fps | fold -w 70
0000000000000000000001000000000000000000000000000000000000000000000000
0000000000000000000000000020000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000200000000000008000000004000001000000
0000000000000800008000000000000000000000000000000000000000000000100000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000500400000000000000000000000000000200002000000000000000
0000000000000000000000        benzylamine

then use the fingerprint file as input:

% time simsearch --queries benzylamine.fps -k 2 chembl_34.fpb > /dev/null
0.091u 0.025s 0:00.16 68.7%   0+0k 0+0io 23pf+0w

The 0.25 seconds of startup time, plus the 0.16 seconds of search time, minus the doubly-counted Python startup time of 0.04 seconds, is 0.33 seconds, which is quite close to the time of 0.34 seconds I measured earlier.

To give a direct comparison, I’ll use the same query file to search the FPS file:

% time simsearch -q benzylamine.fps -k 2 chembl_34.fps > /dev/null
1.285u 0.154s 0:01.60 89.3%   0+0k 0+0io 8pf+0w

FPB search is now about 10 times faster than FPS search. And of that 0.16 seconds, 0.025 seconds or 15% is in file I/O.

Multi-core similarity search¶

In this section you’ll learn how chemfp parallelizes multiple queries and the various configuration options. You will need ChEMBL 34 as a chemfp-licensed FPB file, the PubChem file Compound_099000001_099500000.sdf.gz, and the RDKit toolkit.

Suppose you want the ChEMBL 34 record which is most common to each record in Compound_099000001_099500000.sdf.gz. This can be done with simsearch as a k=1 nearest-neighbor search, with the PubChem fingerprints as queries, and the ChEMBL file as targets.

This means converting the PubChem structure file into a fingerprint file using the same fingerprint type that ChEMBL uses:

rdkit2fps Compound_099000001_099500000.sdf.gz \
   --using chembl_34.fpb -o Compound_099000001_099500000.fps

Then do the search, saving the results to “results.txt”:

time simsearch -k 1 -q Compound_099000001_099500000.fps \
   chembl_34.fpb -o results.txt

Note

This will take a couple of minutes to run. While it’s doing that, perhaps take a look at the CPU load.

Simsearch support progress bars, which in this case shows the progress through the input file:

position:  47%|███████▊         | 2.71M/5.73M [00:50<00:55, 54.0k bytes/s]

Once simsearch finished, “time” reported:

704.387u 6.044s 1:49.96 646.0%        0+0k 0+0io 65630pf+0w

The “646.0%” refers to the total CPU time divided by the wall-clock time. It can be greater than 100% when using multiple cores in parallel, which is what happened here.

Converting large data sets to FPB format¶

In this section you’ll learn how to generate an FPB file on computers with relatively limited memory. To be realistic, this example uses the complete PubChem data set, and extracts the CACTVS/PubChem fingerprints which are in each record. You do not need a chemistry toolkit for this section.

The most direct way to extract the PubChem fingerprints from a PubChem distribution is to use sdf2fps:

sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb

This uses the default FPB writer options, which stores all of the fingerprints in memory, sorts them, and saves the result to the output file. This may use several times as much memory as the final FPB output size, which is a bit unfortunate if you want to generate a 7 GB FPB file on a 12 GB machine.

When I updated this section in June 2020, it took around 25GB of memory to create an FPB file with 102,768,482 PubChem fingerprints, and the final file was about 14GB. (I did not update this section in 2024 for the chemfp 4.2 release.)

(Note: see the next section for a two-stage solution that lets you parallelize fingerprint generation.)

The “*2fps” command-line tools do not have a way to change the default writer options, although fpcat does. The --max-spool-size option sets a rough upper bound to the amount of memory to use. When enabled, the writer breaks the input into parts and creates a temporary FPB file for each part. At the end, it merges the sorted data from the temporary FPB files to get the final FPB file. Be aware that the specified spool size is only approximate and is not a hard limit on the maximum amount of memory to use. You may need to experiment a bit if you have tight constraints, and this option might not be as useful as I thought it was.

The value must be a size in bytes, though suffixes like M or MB for megabyte and T or TB for terabyte are also allowed. These are in base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed between the number and the suffix, so “200MB” is okay but “200 MB” is not. The size must be at least 20 MB.

Here is an example of how to convert the CACTVS fingerprints from all of PubChem to an FPB file, using a relatively small limit of 200 MB:

sdf2fps --pubchem pubchem/Compound_*.sdf.gz | \
    fpcat --max-spool-size 200MB -o pubchem.fpb

This will take a while! The sdf2fps alone takes almost 45 minutes on a ca. 2017-era Haswell machine.

If I save the intermediate results to an FPS file then the in-memory fpcat conversion from FPS to FPB takes 5½ minutes and requires 25GB of memory.

With spool of 200MB, the conversion takes nearly 10 minutes. According to htop, the spooled conversion required, near the peak, 13.3G of virtual memory, a resident set size of 12G, and 10.6G of shared shared pages. The shared pages are from memory-mapping the intermediate FPB files, so this probably required only 2GB of real memory.

If I use a 1GB spool size, the conversion time decreases from 10 to 8 minutes, and uses about the same amount of peak memory.

The temporary files will be placed under the appropriate temporary directory for your operating system. If that disk isn’t large enough for the intermediate files then use the --tmpdir option of fpcat to specify an alternate directory:

fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb \
    --tmpdir /usr/tmp

Another option is to specify the directory location using the TMPDIR, TEMP, or TMP environment variables, which are resolved in that order. The details are described in the Python documentation for tempfile.tempdir.

Faster gzip decompression¶

In this section you’ll learn how to use an external program to improve the performance of reading gzip files. You will need PubChem file Compound_016000001_016500000.sdf.gz.

The PubChem file Compound_016000001_016500000.sdf.gz is one of the largest files from PubChem, measured by compressed size. The copy I have is 544M compressed and 3.9G uncompressed.

It takes about 10 seconds to extract the PubChem fingerprints from the compressed file:

% sdf2fps --pubchem Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 100%|█████████| 570M/570M [00:09<00:00, 57.2Mbytes/s]

In comparison, it takes about 5 seconds to extract the fingerprints from the uncompressed file:

% sdf2fps --pubchem Compound_016000001_016500000.sdf -o pubchem.fps
Compound_016000001_016500000.sdf: 100%|███████████████| 4.19G/4.19G [00:05<00:00, 770Mbytes/s]

For this case, gzip decompression adds 100% overhead to the processing step!

There are two sources for the overhead. First, a fair amount of chemfp’s internal gzip processing uses Python. Second, it’s single-threaded, that is, a block of text is decompressed, then passes to the FPS parser.

Chemfp has the option to run an external program to handle decompression. Use the environment “CHEMFP_GZCAT” to specify the program to use. If called with no arguments, it must read from stdin. If called with one argument, it must read from the named file. The decompressed data must be written to stdout, so it can be read by chemfp.

In most cases you’ll use “zcat” (for Linux-based OSes) or “gzcat” (for macOS).

Here’s an example:

% env CHEMFP_GZCAT=gzcat sdf2fps --pubchem \
   Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 443113 recs [00:08, 52785.22 recs/s]

(The progress output can’t give a progress bar because chemfp doesn’t know the current read position relative to the entire file.)

This took about 8 seconds, which is 2 seconds faster than usual, or only about 60% overhead.

An improvement of two seconds may not sound like much, but the 10% performance gain is nice when processing many gzip-compressed files.

Generate fingerprints in parallel and merge to FPB format¶

In this section you’ll learn how to merge multiple sorted fingerprints into a single FPB file.

The previous section used a single shell command to extract the PubChem/CACTVS fingerprints from PubChem and generate an FPB file. This is easy to write and understand, but more complex versions may be more appropriate.

For one, I have four cores on my desktop computer, and I want to use them to process the PubChem files in parallel. The previous section was only single threaded.

I have all my PubChem files in ~/pubchem/. For each “Compound_*.sdf.gz” file in that directory I want to extract the CACTVS/PubChem fingerprints and create an intermediate FPS file in the local directory. That’s equivalent to running the following commands:

sdf2fps --pubchem ~/pubchem/Compound_000000001_000500000.sdf.gz \\
     -o Compound_000000001_000500000.fps
sdf2fps --pubchem ~/pubchem/Compound_000500001_001000000.sdf.gz \\
     -o Compound_000500001_001000000.fps
 ... 291 more lines ...

except that I want to run four at a time.

This is what GNU Parallel was designed for. It’s a command-line tool which can parallelize the execution of other command-lines.

I’ll start by explaining the core command-line substitution pattern:

sdf2fps --pubchem {} -o {/..}.fps'

The {} will be replaced with a filename, and {/..} will be replaced with the base filename, without the directory path prefix or the two suffixes. That is, when {} is “/Users/dalke/pubchem/Compound_000000001_000500000.sdf.gz” then {/..} will be “Compound_000000001_000500000.fps”.

Since I want to generate an FPS file, I added the “.fps” as a suffix to the second substitution parameter.

I then tell GNU parallel which command-line to use, along with a few other parameters. Here’s the full line, which I split over two lines to make it more readable:

parallel --plus --no-notice --bar 'sdf2fps --pubchem {}
   -o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz

The --plus tells GNU parallel to recognize an expanded set of replacement strings. (“{/..}” is not part of the standard set of patterns.)

The --no-notice tells it to not display the message about citing GNU parallel in scientific papers.

The --bar enables a progress bar, which looks like this:

30% 88:205=11m17s /Users/dalke/pubchem/Compound_045500001_046000000.sdf.gz

This status line shows that processing is 30% complete, which is file 88 out of 205, and there’s an estimated 11 minutes and 17 seconds remaining.

Finally, the “:::” indicates that the remaining options are the list of parameters to pass to the command-line template for parallelization.

After about 21 minutes, using 4 CPUs on my laptop (with an effective scaling of 2.8), I now have a large number of FPS files, which I want to merge into a single FPB file. I’ll use fpcat:

fpcat --max-spool-size 1GB Compound*.fps -o pubchem.fpb

Unfortunately my laptop ran out of disk space, so I’ll just leave it a that; re-doing the same command on a server machine won’t provide you any new information.

Generate a full Tanimoto similarity array¶

In this section you’ll learn how to use the chemfp simarray command to generate the complete array containing the Tanimoto similarity between every fingerprint in a dataset with every other fingerprint, including itself. (See Use simarray to generate a NumPy array for an example using the chemfp.simarray() API function.)

The simsearch command generates a sparse matrix, because it was designed for task like “find the nearest 10 fingerprints” or “find fingerprints which are at least 0.4 similar.”

Many algorithms, like some of the sckit-learn clustering methods, require a complete NxN comparison matrix between every element and every other element, including with itself. While this is possible with simsearch, by using a threshold of 0.0, doing so adds some performance and space overhead.

Chemfp 4.2 added “simarray” functionality, to compute the entire matrix either as a NumPy array, or in a file format compatible with loading into NumPy.

The chemfp simarray subcommand by default generates this matrix in memory and saves it to NumPy’s npy format.

Important

You must install NumPy for this subcommand to work.

I’ll generate the full matrix using the 10,969 PubChem fingerprints <pubchem_fingerprints> extracted earlier, and available in “pubchem_queries.fps”:

% chemfp simarray pubchem_queries.fps -o pubchem_queries_NxN.npy
pubchem_queries.fps: 100%|██████████████| 2.54M/2.54M [00:00<00:00, 179Mbytes/s]
scores: 100%|████████████████████████████████| 60.2M/60.2M [00:00<00:00, 125M/s]

NumPy can read the “npy” file directly:

>>> import numpy
>>> arr = numpy.load("pubchem_queries_NxN.npy")
>>> arr[:3,:4]
array([[1.        , 0.44635193, 0.18627451, 0.5258216 ],
       [0.44635193, 1.        , 0.25543478, 0.53921569],
       [0.18627451, 0.25543478, 1.        , 0.25142857]])

The npy file actually stores up to four arrays, depending on how the data was generated. The second array stores the simarray details as JSON-encoded string, and the third and optional fourth store the target and query ids, respectively.

These can be read directly from NumPy:

>>> import numpy, pprint, json
>>> f = open("pubchem_queries_NxN.npy", "rb")
>>> numpy.load(f)[:3,:4]
array([[1.        , 0.44635193, 0.18627451, 0.5258216 ],
       [0.44635193, 1.        , 0.25543478, 0.53921569],
       [0.18627451, 0.25543478, 1.        , 0.25142857]])
>>> pprint.pprint(json.loads(str(numpy.load(f))))
{'dtype': 'float64',
 'format': 'multiple',
 'matrix_type': 'NxN',
 'method': 'Tanimoto',
 'metric': {'as_distance': False,
            'is_distance': False,
            'is_similarity': True,
            'name': 'Tanimoto'},
 'metric_description': 'Tanimoto similarity',
 'num_bits': 881,
 'shape': [10969, 10969]}
>>> numpy.load(f)[:4] # show the first four array identifiers
array(['99000039', '99000230', '99001517', '99002251'], dtype='<U8')

or loaded by using chemfp’s load_simarray() helper function:

>>> import chemfp
>>> simarr = chemfp.load_simarray("pubchem_queries_NxN.npy")
>>> simarr.out[:3, :4]
array([[1.        , 0.44635193, 0.18627451, 0.5258216 ],
       [0.44635193, 1.        , 0.25543478, 0.53921569],
       [0.18627451, 0.25543478, 1.        , 0.25142857]])
>>> simarr
SimarrayFileContent(metric=<Tanimoto similarity>,
out=<10969x10969 symmetric array of dtype float64>,
query_ids=target_ids=['99000039', '99000230', '99001517', ...])

Unlike simsearch, simarray does NOT change the record order - the fingerprints are kept in input order rather than sorting by popcount, which you can check by comparing the first few identifiers in the previous output to those in the FPS file:

% awk '/^[^#]/ {print $2}' pubchem_queries.fps | head -5
99000039
99000230
99001517
99002251
99003537

Why? I compared several simsearch implementations, including one which was optimized by taking advantage of popcount ordering. I saw no noticable difference. My working belief is that the problem is mostly limited by memory write performance, while simsearch is mostly limited by memory read performance. I don’t know enough about hardware to fully explore that.

In practice, preserving input order also makes it much easier to map between the input data set and the array, without having to figure out the reordering.

Lastly, for symmetric arrays use --no-lower-triangle so simarray does not compute the comparisons for the lower triangle, but instead leaves them as zeros. This gives faster performance but does not save any space.

Generate the array using another metric or datatype¶

In this section you’ll learn how simarray supports several different comparison types, and how some of the comparison types can be stored in several different data types.

By default simarray generate Tanimoto scores stored as 64-bit floating point number (also called “float64” or what C calls a “double”). Why Tanimoto? Because it’s by far the most popular similarity method in cheminformatics.

Simarray supports a few other comparison types using the --metric flag: “Tanimoto” similarity, “Dice” similarity, and “cosine” similarity, as well as the “Hamming” distance. In addition, with the --as-distance flag it will turn the similarity calculations into a distance computed as 1-similarity.

For example, the following computes the Jaccard distance (1-Tanimoto score):

chemfp simarray pubchem_queries.fps --metric Tanimoto --as-distance \
   -o distances.npy

If you are using 4096 bit fingerprints or smaller then every Tanimoto score can be represented uniquely with float32, which uses half as much memory as the default float64. (Even if you use larger fingerprints, you’ll need an awful lot of on-bits before it’s possible to find two different scores which have the same 32-bit representation.)

Use the --dtype option to change the array datatype. The “Tanimoto”, “Dice”, and “cosine” metrics support “float64”, “float32” and a scaled “uint16” which uses only the values 0 to 65535 (computed as int(score*65535)) for similarity or 65535-int(score*65535) for distance). The following computes the cosine distance as “uint16”, so identical fingerprints have a distance of 0:

chemfp simarray pubchem_queries.fps --metric cosine \
   --as-distance --dtype uint16 -o pubchem_cosine_dist.npy

Here’s what it looks like in NumPy:

>>> import numpy as np
>>> np.load("pubchem_cosine_dist.npy")[:3,:4]
array([[    0, 25064, 42640, 20252],
       [25064,     0, 36278, 19585],
       [42640, 36278,     0, 37078]], dtype=uint16)

The “Tanimoto” and “Dice” metrics also support the “rational64” and “rational32” data types, which use a NumPy structured data type where the first term (“p”) is the numerator and the second term (“q”) is the denominator. Rational64 uses two uint32 values and rational32 uses two uint16 values, which means rational32 uses the same space as float32 but with no precision loss.

These store the scores as rational values, in unreduced form:

>>> import numpy as np
>>> scores = np.load("pubchem_rational32.npy")
>>> scores[:3,:4]
array([[(174, 174), (104, 233), ( 38, 204), (112, 213)],
       [(104, 233), (163, 163), ( 47, 184), (110, 204)],
       [( 38, 204), ( 47, 184), ( 68,  68), ( 44, 175)]],
      dtype=[('p', '<u2'), ('q', '<u2')])
>>> scores["p"][:3,:4]
array([[174, 104,  38, 112],
       [104, 163,  47, 110],
       [ 38,  47,  68,  44]], dtype=uint16)
>>> scores["q"][:3,:4]
array([[174, 233, 204, 213],
       [233, 163, 184, 204],
       [204, 184,  68, 175]], dtype=uint16)

If the similarity score is 0/0 then the corresponding distance score is n/n where n is the fingerprint size.

The “Hamming” metric is only available as “uint16”. The Tversky similarity isn’t available in simarray, though it can be computed using one of the “abcd” metrics in the next section.

Generate a simarray using an “abcd” metric¶

In this section you’ll learn about simarray’s “abcd” data type, which the four individual components for comparing two fingerprints. These can be used to construct your own comparison value.

The “Willett”, “Sheffield”, and “Daylight” metrics all compute an “abcd” structured data type with four uint16 values, individually denoted “a”, “b”, “c”, and “d”. These can be used to create your own comparison function, and indeed the cheminformatics literature describes scores of these, with each formula described using these four values, like:

Baroni-Urbani/Buser:  (sqrt(a*d) + a) / (sqrt(a*d) + a + b + c)

Unfortunately, different authors use different conventions for what those values mean. Fortunately, there are only two main conventions, which I’ll call “Willett” and “Sheffield”, along with the “Daylight” convention in a distant third. Chemfp’s simarray supports all three.

1) The “Willett” convention is the most widely used in the field, including internally by chemfp. It dates from around 1986 with the seminal work by Peter Willett and his collaborators, and was the primary convention used by the Sheffield group until around 2008 (earlier if John Holliday is one of the authors).

It uses the following definitions:

“a” = the number of on-bits in the first fingerprint
“b” = the number of on-bits in the second fingerprint
“c” = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2

Note that a+b-c+d is the number of bits in the fingerprint.

2) The “Sheffield” convention was used in papers from the Sheffield group from 1973 up until around 1985, and by most papers from the Sheffield group (including by Peter Willett) after around 2008.

“a” = the number of on-bits in common
“b” = the number of on-bits unique to the first fingerprint
“c” = the number of on-bits unique to the second fingerprint
“d” = the number of off-bits in fp1 which are also off-bits in fp2

Note that a+b+c+d is the number of bits in the fingerprint.

3) The “Daylight” convention was introduced by John Bradshaw and used in the Daylight documentation and Daylight toolkit:

“a” = the number of on-bits unique to the first fingerprint
“b” = the number of on-bits unique to the second fingerprint
“c” = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2

Note that a+b+c+d is the number of bits in the fingerprint.

Suppose you want the Baroni-Urbani/Buser score mentioned above, which comes from a 2002 paper by Holliday, Hu, and Willett. You’ll need to read the paper to see it uses what chemfp calls the “Sheffield” convention (as a hint, Holliday is one of the authors), then ask simarray to compute the relevant matrix:

chemfp simarray pubchem_queries.fps --metric Sheffield \
  -o pubchem_abcd.npy --no-progress

The final computation requires a bit of NumPy help, first to define the function, then to load the array, and finally to compute the values:

>>> import numpy as np
>>> def baroni_urbani(abcd_arr):
...   a, b, c, d = abcd_arr["a"], abcd_arr["b"], abcd_arr["c"], abcd_arr["d"]
...   numerator = np.sqrt(a * d) + a
...   return numerator / (numerator + b + c)
...
>>>
>>> abcd_arr = np.load("pubchem_abcd.npy")
>>> abcd_arr[:2,:3]
array([[(174,   0,  0, 707), (104,  70, 59, 648), ( 38, 136, 30, 677)],
       [(104,  59, 70, 648), (163,   0,  0, 718), ( 47, 116, 21, 697)]],
      dtype=[('a', '<u2'), ('b', '<u2'), ('c', '<u2'), ('d', '<u2')])
>>>
>>> baroni_urbani(abcd_arr)[:2, :3]
array([[1.        , 0.5327464 , 0.54444826],
       [0.5327464 , 1.        , 0.62465185]], dtype=float32)

Simarray and arrays larger than memory¶

In this section you’ll learn how to use simarray to handle arrays which are larger than the available memory.

Chemfp’s simarray generally creates the full comparison array in memory then saves the result to NumPy’s “npy” format. What if you want to generate an even larger array?

Why? Perhaps you, like me, primarily work on a laptop and don’t want to close all your applications just to free up a few gigabytes of data, and are willing to have extra file I/O overhead instead. Or perhaps you have 40 TB of network drive space free and want to wait the day needed to compute a 3M x 3M similarity array before getting time on the “big memory” analysis machine with 50 TB of RAM.

The chemfp simarray command (but not the API!) supports the “bin” output, which saves the output array as the raw binary data, in row major format. It processes a window (by default, 2GB) of the output array and appends it to the output file before processing the next window.

The following example uses the “pubchem_queries.fps” and “pubchem_queries.fps” created in a previous section to generate the cosine similarity, as float32 values, between all queries and all targets:

% chemfp simarray --queries pubchem_queries.fps pubchem_targets.fps \
      --metric cosine --dtype float32 -o NxM.bin
pubchem_targets.fps: 100%|█████████████████ 3.42M/3.42M [00:00<00:00, 95.2Mbytes/s]
pubchem_queries.fps: 100%|███████████████████| 2.54M/2.54M [00:00<00:00, 103Mbytes/s]
scores: 100%|█████████████████████████████████████| 162M/162M [00:01<00:00, 153M/s]

The output file contains the raw array value, which NumPy can access as a memory-mapped file, if you know the shape and dtype:

>>> import chemfp
>>> len(chemfp.load_fingerprints("pubchem_queries.fps"))
10969
>>> len(chemfp.load_fingerprints("pubchem_targets.fps"))
14753
>>> import numpy as np
>>> arr = np.memmap("NxM.bin", shape=(10969, 14753), dtype=np.float32)
>>> arr[:3, :4]
memmap([[0.75586504, 0.65900594, 0.6916995 , 0.5689655 ],
        [0.56468934, 0.75383073, 0.61808264, 0.73629737],
        [0.35343105, 0.45178595, 0.46850318, 0.349345  ]], dtype=float32)

Alternatively, if you prefer to load the data into memory instead of using a memory-mapped array:

>>> import pathlib
>>> content = pathlib.Path("NxM.bin").read_bytes()
>>> arr = np.ndarray(buffer = content, shape=(10969, 14753), dtype=np.float32)
>>> arr[:3, :4]
array([[0.75586504, 0.65900594, 0.6916995 , 0.5689655 ],
       [0.56468934, 0.75383073, 0.61808264, 0.73629737],
       [0.35343105, 0.45178595, 0.46850318, 0.349345  ]], dtype=float32)

While this works, it’s a bit cumbersome to track the shape and data type, and the array does not contain the query or target ids.

Instead, use the --metadata-output option to have simarray save this data to an auxillary “npy” file:

% chemfp simarray --queries pubchem_queries.fps pubchem_targets.fps \
    --metric cosine --dtype float32 -o NxM.bin --metadata-output NxM_aux.npy
pubchem_targets.fps: 100%|████████████████| 3.42M/3.42M [00:00<00:00, 96.9Mbytes/s]
pubchem_queries.fps: 100%|█████████████████| 2.54M/2.54M [00:00<00:00, 103Mbytes/s]
scores: 100%|█████████████████████████████████████| 162M/162M [00:00<00:00, 163M/s]

This “npy” format is the same as the normal simarray npy output, except the first array, which normally contains the full set of comparisons, is instead zero-sized. It exists to store the relevant NumPy dtype. Here’s a low-level example:

>>> import numpy as np
>>> f = open("NxM_aux.npy", "rb")
>>> np.load(f)
array([], shape=(0, 0), dtype=float32)
>>> import pprint, json
>>> pprint.pprint(json.loads(str(np.load(f))))
{'dtype': 'float32',
 'format': 'multiple',
 'matrix_type': 'NxM',
 'method': 'cosine',
 'metric': {'as_distance': False,
            'is_distance': False,
            'is_similarity': True,
            'name': 'cosine'},
 'metric_description': 'cosine similarity',
 'num_bits': 881,
 'shape': [10969, 14753]}
>>> np.load(f)[:3] # the first three target ids
array(['48500020', '48500053', '48500091'], dtype='<U8')
>>> np.load(f)[:3] # the first three query ids
array(['99000039', '99000230', '99001517'], dtype='<U8')

However, it’s much easier to use chemfp’s load_simarray(), which accepts both the binary file and the auxillary npy file:

>>> import chemfp
>>> simarr = chemfp.load_simarray("NxM.bin", metadata_source="NxM_aux.npy")
>>> simarr
SimarrayFileContent(metric=<cosine similarity>, out=<10969x14753 array
of dtype float32>, query_ids=['99000039', '99000230', '99001517', ...],
target_ids=['48500020', '48500053', '48500091', ...])

If the data source is a bin file (based on the specified format of “bin” or the filename extension “.bin”) then the array data by default is memory-mapped, because the array is expected to be large, and the memory-map is read-only (in “r” mode).

Use mmap_mode to change how this works. “r+” means read-write mode, “c” means “copy-on-write”, and None means to load the data into memory.

Uf the data source is a npy file then the array data by default is read into memory. The other mmap options, like “c”, can also be used with npy files.

Butina on the command-line¶

In this section you’ll learn how to use the chemfp butina subcommand for Butina clustering. (See Butina clustering for an example using the chemfp.butina() API function. The chemfp 4.1 release notes have additional examples.)

Butina clustering (often called Taylor-Butina clustering, though Taylor didn’t apply the algorithm to clustering) first generates the NxN Tanimoto similarity array at a given threshold then generates clusters. It orders the fingerprints by neighbor count, so the fingerprints with the most neighbors come first. It selects the first ranked fingerprint, which is the first centroid, and includes its neighbors as part of the cluster. These fingerprints are removed from future consideration. The selection process then repeats until done, with a number of variations for how to define “done” and how to handle singletons.

The Butina algorithm is available as a subcommand of chemfp. Using a small set of fingerprints as an example:

% cat distinct.fps
1200ea4311018100      id1
894dc00a870b0800      id2
82b180a0943d0254      id3
d04a450013d8a479      id4
8025719c2c907c2e      id5
918125cf96a2a360      id6
6b0d4215ca47fc03      id7
2eb000ce2e4337ef      id8
849a6de873554c59      id9
b67121e96f48a365      id10
2ab0f5b64cb4b8cf      id11
e7fedcefa4e7560e      id12

I’ll cluster at a threshold of 0.41:

% chemfp butina distinct.fps -t 0.41
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=158559300 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i     center_id       count   members
1     id8     5       id8     1.0000  id10    0.4651  id12    0.4400  id11    0.4130  id5     0.3256
2     id9     2       id9     1.0000  id4     0.4103
3     id1     1       id1     1.0000
4     id2     1       id2     1.0000
5     id6     1       id6     1.0000
6     id7     1       id7     1.0000
7     id3     1       id3     1.0000

This output follows the usual chemfp output format style where the first line describes the format, followed by header lines starting with ‘#’, then the output data proper, using tab-separated columns.

The default uses a variable number of columns, with one line per cluster. The first column is the column id (an integer), the second is the fingerprint id for the center fingerprint, and the third is the number of elements in the cluster (including the center).

After the count are the id and score of every fingerprint in the cluster, starting with the cluster center. These can be excluded using --no-members:

% chemfp butina distinct.fps -t 0.41 --no-members --seed 158559300
#Centroid/1 include-members=0
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=158559300 \
       false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i     center_id       count
1     id8     5
2     id9     2
3     id1     1
4     id2     1
5     id6     1
6     id7     1
7     id3     1

The “type” header shows some of the available options, like how to break ties or what to do with “false singletons” (these are cluster centers which no longer have neighbors because they were all assigned to another cluster). See chemfp butina --help for details.

Pruning the number of Butina clusters¶

Use --num-clusters or simply -n option to prune the clusters down to a smaller value. In the following I’ve specified I want only 3 clusters, so four of the clusters will be pruned, and their fingerprints moved to other clusters:

% chemfp butina distinct.fps -t 0.41 --num-clusters 3
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=3257902857 false-singletons=follow-neighbor num-clusters=3 rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i     center_id       count   members
1     id8     8       id8     1.0000  id10    0.4651  id12    0.4400  id11    0.4130  id5     0.3256  id3     0.2381  id9     0.2917  id4     0.1702
2     id6     2       id6     1.0000  id1     0.2353
3     id7     2       id7     1.0000  id2     0.2973

Alternate Butina output formats¶

In this section you’ll learn about some of the Butina output formats and options. This uses the distinct.fps file from the previous section with 12 fingerprints.

The butina command includes a few options which control the output format, like the following which does not include the metadata (the lines starting with “#”), and which only outputs the cluster centers. It also specifies the initial seed for random number generation:

% chemfp butina distinct.fps -t 0.41 --no-members --no-metadata --seed 19573998
i     center_id       count
   id8     5
   id9     2
   id3     1
   id7     1
   id1     1
   id2     1
   id6     1

The --out option supports several alternative formats. The “flat” variant of the centroid format writes one output line for each fingerprint, with the centroid id in the first column, the fingerprint id in the second, the cluster element label in the third, and the score in the fourth:

% chemfp butina distinct.fps -t 0.41 --out flat --seed 19573998
#Centroid-flat/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
centroid      id      type    score
5     id1     CENTER  1.0000
6     id2     CENTER  1.0000
3     id3     CENTER  1.0000
2     id4     MEMBER  0.4103
1     id5     MEMBER  0.3256
7     id6     CENTER  1.0000
4     id7     CENTER  1.0000
1     id8     CENTER  1.0000
2     id9     CENTER  1.0000
1     id10    MEMBER  0.4651
1     id11    MEMBER  0.4130
1     id12    MEMBER  0.4400

The -no-rename option tells the “flat” output to use the full internal labels for each fingerprint, which in this case shows that “id5” was a false singleton moved to cluster 1, and “id1”, “id2”, “id3”, and “id4” were moved by the pruning step:

% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
   --seed 3601395311 --no-metadata --out flat --no-rename
centroid      id      type    score
   id1     MERGED  0.2353
   id2     MERGED  0.2973
   id3     MERGED  0.2381
   id4     MERGED  0.1702
   id5     MOVED_FALSE_SINGLETON   0.3256
   id6     CENTER  1.0000
   id7     CENTER  1.0000
   id8     CENTER  1.0000
   id9     MERGED  0.2917
   id10    MEMBER  0.4651
   id11    MEMBER  0.4130
   id12    MEMBER  0.4400

The “csv” and “tsv” formats are comma- and tab-separated formats which are a hybrid between the “centroid” and “flat” formats. They have one output line for each fingerprint, ordered by cluster number (largest first) and further ordered by score, with the cluster center always first.

The output fields are the cluster id, the fingerprint id, the number of elements in the cluster, the fingerprint type (“CENTER”, “MEMBER”, and more if using --no-rename), and the score.

% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
    --seed 3601395311 --out csv
cluster,id,count,type,score
1,id8,8,CENTER,1.0000
1,id10,8,MEMBER,0.4651
1,id12,8,MEMBER,0.4400
1,id11,8,MEMBER,0.4130
1,id5,8,MEMBER,0.3256
1,id3,8,MEMBER,0.2381
1,id4,8,MEMBER,0.1702
1,id9,8,MEMBER,0.2917
2,id6,2,CENTER,1.0000
2,id1,2,MEMBER,0.2353
3,id7,2,CENTER,1.0000
3,id2,2,MEMBER,0.2973

The output was formatted using Python’s csv module, with the “excel” and “excel-tab” dialects, respectively, which means the results should be easy to import into Excel. The writer may use special quoting rules should the identifier contain a comma, newline, or other special character.

Butina parameter tuning with npz files¶

In this section you’ll learn how to use an npz file to store an NxN similarity array so does not need to be re-computed as you tune the Butina parameters.

The Butina parameters almost certainly require some tuning to find the appropriate threshold, and to decide if cluster pruning is appropriate.

The Butina algorithm is quite fast. It takes just a few seconds, in the default case, to cluster the NxN similarity matrix from the 2+ million fingerprints in ChEMBL. Most of the time is spent formatting the output correctly!

On the other hand, even with chemfp it can take 30 minutes or even a couple of hours to generate the matrix in the first place - making tuning a practice in patience!

This is where simarray’s “npz” format comes in handy. You can generate the NxN matrix once, using the lower-bound similarity threshold for your search space, and save the result to an npz file. This pre-computed matrix can then be used as input to Butina clustering. All chemfp need to do then is count the number of entries at or above the Butina threshold, which is must faster then computing milions of Tanimoto calculations.

(Note: currently only a “no-diagonal” NxN matrix is supported. A future version may allow a full NxN matrix. Also, the entries should be ordered by decreasing score. If they are not in that order, chemfp will reorder them.)

You can use the simsearch command to generate the npz file but it’s easier to use the --save-matrix option of the Butina command. In the following I’ll write the output to /dev/null and I’ll show timing details for the individual steps:

% chemfp butina --threshold 0.4 benzodiazepine.fps.gz \
   --save-matrix benzodiazepine.npz --times -o /dev/null
load_fingerprints 0.06 NxN 0.50 save_matrix 2.79 cluster 0.00
  rescore 0.00 output 0.01 total 3.43

What took the longest was the time to write the matrix to an npz file. The relatively low threshold of 0.4, combined with the high similarity of the benzodiazepines, mean there are 9,541,836 non-zero entries out of 153,400,610 possible non-zero values (the diagonals are not included), giving a density of 6.2%.

I can now cluster with the saved NxN matrix instead of using fingerprints, first, using all of the entries in the matrix:

% chemfp butina benzodiazepine.npz --times -o /dev/null
load_matrix 0.25 cluster 0.00 output 0.01 total 0.34

and then using --butina-threshold to change the minimum similarity threshold for the Butina algorithm:

% chemfp butina benzodiazepine.npz --times -o /dev/null \
   --butina-threshold 0.5
load_matrix 0.25 cluster 0.00 output 0.02 total 0.35

Number of Butina clusters in ChEMBL¶

In this subsection I’ll show an example of using the same simsearch npz result as input to Butina, to show how the number of clusters depends on the Butina threshold.

Those timing numbers probably don’t look that impressive for the small benzodiazepine data set. I’ll switch to “chembl_34_60.npz”, which contains the NxN similarity matrix for CheMBL 30 at threshold of 0.60 using RDKit’s default Morgan2 fingerprints, and see how the number of clusters changes for for 0.60, 0.61, 0.62, … 0.70.

This uses the licensed FPB file chembl_34.fpb, which includes an embedded chemfp license key so it can be used even under the terms of chemfp’s base license agrement.

The first step is to compute the matrix for the lowest desired Butina threshold of 0.6. This can be done as part of the first Butina calculation (using --save-matrix), but I’ll do it with the simarray command-line program:

% simsearch --NxN chembl_34.fpb --threshold 0.6 \
      --ordering decreasing-score -o chembl_34_60.npz
queries: 100%|██████████████████| 2409270/2409270 [5:04:10<00:00, 132.01 fps/s]

5 hours on my laptop, while doing other things, ain’t too shabby.

I use the --ordering option to improve the Butina clustering performance by sorting the hits so those with high similarity scores come before those with lower similarity scores. Otherwise, the hits in a simsearch threshold do not have a specific order (it depends on the fingerprint ordering and internal chemfp implementation details.)

The Butina algorithm needs the count of all neighbors within a specified similarity. This can be done in O(N) time if the hits are unsorted, but in O(log(N)) time if in decreasing-score order by using a binary search. You probably realize that a comparison sort requires O(N log(N)) time to sort the hits, which is worse then O(N). It’s a good thing then that chemfp uses a radix sort, with O(N) performance for similarity scores.

With the --ordering option, the simsearch output is appropriately sorted, so the Butina analysis does not need to re-sort it each time.

Now that the simsearch array is (finally) generated, I’ll run Butina clustering for a threshold of 0.6 and get the last line. I’ll also have it report some internal timings:

% chemfp butina chembl_34_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 2.55 cluster 0.22 output 2.36 total 5.50
428230        CHEMBL4992339   1       CHEMBL4992339   1.0000000

The output order is a bit odd because the times line written to stderr occurs before the stdout output lines piped through tail, but easy enough to figure out.

This says it took about 2.5 seconds to read the file, 0.2 seconds to cluster, and 2.4 seconds to write the 428,230 lines of output.

The clustering time is quite quick - it takes 10 times longer to read the array and to format and print the output! If you really wanted to get the cluster count you could write a program using the chemfp API, but it’s quicker to run the commands and wait the extra few seconds:

% chemfp butina chembl_34_60.npz --butina-threshold 0.60 --no-progress | tail -1
428398        CHEMBL2228156   1       CHEMBL2228156   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.61 --no-progress | tail -1
462162        CHEMBL300581    1       CHEMBL300581    1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.62 --no-progress | tail -1
491495        CHEMBL5016448   1       CHEMBL5016448   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.63 --no-progress | tail -1
524751        CHEMBL3601495   1       CHEMBL3601495   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.64 --no-progress | tail -1
557533        CHEMBL549663    1       CHEMBL549663    1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.65 --no-progress | tail -1
593861        CHEMBL4648644   1       CHEMBL4648644   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.66 --no-progress | tail -1
633683        CHEMBL3197889   1       CHEMBL3197889   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.67 --no-progress | tail -1
675880        CHEMBL4908000   1       CHEMBL4908000   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.68 --no-progress | tail -1
718190        CHEMBL5009625   1       CHEMBL5009625   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.69 --no-progress | tail -1
766726        CHEMBL4553024   1       CHEMBL4553024   1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.70 --no-progress | tail -1
811376        CHEMBL4562579   1       CHEMBL4562579   1.0000000

I used --no-progress to disable the progress bars, which otherwise appear after 0.5 seconds. You can use --progress or --no-progress to always enable or diable them, or set the environment variable “CHEMFP_PROGRESS” to “1” or “0”.

I also have the comparison matrix in unordered form (technically it’s an implementation-dependent ordering), which lets us see the Butina warning message that the arbitrary array needs resorting, and see how sorting adds an additional 1.2 seconds to the analysis:

% chemfp butina chembl_34_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 2.42 cluster 0.20 output 2.29 total 5.33
428488        CHEMBL4584568   1       CHEMBL4584568   1.0000000
% chemfp butina chembl_34_60_arbitrary.npz --butina-threshold 0.6 --times | tail -1
Warning: Matrix not in 'decreasing-score' order. Re-sorting.
load_matrix 2.57 sort 1.19 cluster 0.20 output 2.33 total 6.77
428398        CHEMBL4943518   1       CHEMBL4943518   1.0000000

The output differences are due to how chemfp’s Butina implementation handles ties. For example, if two different fingerprints have the same number of neighbors, which should be picked as the next centroid? Chemfp by default randomizes the order, so the results will likely be different for each run. Use --tiebreaker to specify that the first or last centroid instead.

Sphere exclusion¶

In this section you’ll learn how to use sphere exclusion to sample diverse fingerprints from a data set. You will need a copy of chembl_34.fpb. (See Sphere exclusion for an example of sphere exclusion using the chemfp.spherex() API function.)

Sphere exclusion (see Hudson et al. https://doi.org/10.1002/qsar.19960150402) is a form of diversity sampling. It picks an initial fingerprint at random then remove it and all fingerprints within some threshold of similairty form it (that is, all fingerprints within the sphere set by the threshold).

The sphere exclusion prevents future picks will not be similar to previous picks, which is why this is a form of diversity selection.

Sphere exclusion is available on the command-line using the chemfp spherex subcommand. By default it uses a threshold of 1.0 (so only exact matches are excluded) and it keeps picking until all of the fingerprints have been picked.

The following example picks 10 fingerprints using a similarity threshold of 0.4:

% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4
#Spherex/1
#type=spherex threshold=0.4 num-picks=10 randomize=1 seed=616321679
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T12:06:11+00:00
i     pick_id count
1     CHEMBL465089    57
2     CHEMBL249957    131
3     CHEMBL4560417   526
4     CHEMBL549096    172
5     CHEMBL427558    1494
6     CHEMBL360197    46
7     CHEMBL47199     111
8     CHEMBL3655944   197
9     CHEMBL89751     79
10    CHEMBL2087345   82

This format follows chemfp’s usual form, with a metadata section (the lines starting with a #) followed by a data section. The first line describes the format type , followed by metadata lines containing key/value pairs about the search. For example, the “type” line describes the applicable sphere exclusion parameters, which can be used to reproduce the results.

In particular, spherex initially randomizes the fingerprint order. If the seed isn’t specified then it will use Python’s random module to get the initial seed, which is reported in the type. The following specifies the --seed, to show that the result is reproducible:

% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 --seed 616321679 \
         --no-metadata
i     pick_id count
   CHEMBL465089    57
   CHEMBL249957    131
   CHEMBL4560417   526
   CHEMBL549096    172
   CHEMBL427558    1494
   CHEMBL360197    46
   CHEMBL47199     111
   CHEMBL3655944   197
   CHEMBL89751     79
  CHEMBL2087345   82

I used --no-metadata to omit the header section. There’s also --no-randomize, which will start with the first fingerprint in the data set. Since the fingerprints are sorted by popcount, this means fingerprints with the fewest set bits will be picked first, which hasn’t

The data comes after the header, in a tab-separated column format. The first line after the metadata section contains the column titles, which in this case are “i” (the pick number), “pick_id” (the id for the picked fingerprint) and “count” (the number of fingerprints in the exclusion sphere, including the pick itself).

Sphere exclusion output options¶

In this section you will learn about some of the spherex options which affect output, like writing output in CSV format and changing what values are reported.

Chemfp uses its own output format so it can includes metadata in the output. You can use --no-metadata, as in the previous section, to omit that section, which leaves simple tab-separated format.

However, if you need an Excel-compatible tab-separated or format then you must specify the “tsv” output format, because the identifer may contain a special character, like a double quote (or comma for the “csv” format), which has a special meaning for Excel:

% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 \
    --seed 616321679 --out tsv
pick_id       count
CHEMBL465089  57
CHEMBL249957  131
CHEMBL4560417 526
CHEMBL549096  172
CHEMBL427558  1494
CHEMBL360197  46
CHEMBL47199   111
CHEMBL3655944 197
CHEMBL89751   79
CHEMBL2087345 82

Use the “centroid” output format so the output uses the same format as the Butina output:

% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 \
       --seed 616321679 --out centroid
#Centroid/1 include-members=0
#type=spherex threshold=0.4 num-picks=10 randomize=1 seed=616321679
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T13:58:23+00:00
i     center_id       count
1     CHEMBL465089    57
2     CHEMBL249957    131
3     CHEMBL4560417   526
4     CHEMBL549096    172
5     CHEMBL427558    1494
6     CHEMBL360197    46
7     CHEMBL47199     111
8     CHEMBL3655944   197
9     CHEMBL89751     79
10    CHEMBL2087345   82

The --include-members option includes all of the sphere members in the output. For the “spherex” and “centroid” output these are all on one line, after the other columns, alternating between hit id and hit score:

% chemfp spherex chembl_34.fpb -n 4 --threshold 0.75 --seed 4 \
       --include-members --no-metadata
i     pick_id count   hits
1     CHEMBL1643627   1       CHEMBL1643627   1.0000000
2     CHEMBL1703947   2       CHEMBL1703947   1.0000000       CHEMBL1304130   0.7800000
3     CHEMBL400879    1       CHEMBL400879    1.0000000
4     CHEMBL3954540   3       CHEMBL3954540   1.0000000       CHEMBL3936358   0.7500000       CHEMBL3964946   0.7500000

For the “tsv” and “csv” output each match is written on its own line, with the pick id, the hit id, and the hit score:

% chemfp spherex chembl_34.fpb -n 4 --threshold 0.75 --seed 4 \
   --include-members --out csv
pick_id,hit_id,hit_score
CHEMBL1643627,CHEMBL1643627,1.0000000
CHEMBL1703947,CHEMBL1703947,1.0000000
CHEMBL1703947,CHEMBL1304130,0.7800000
CHEMBL400879,CHEMBL400879,1.0000000
CHEMBL3954540,CHEMBL3954540,1.0000000
CHEMBL3954540,CHEMBL3936358,0.7500000
CHEMBL3954540,CHEMBL3964946,0.7500000

Sphere exclusion with initial picks, and saving candidates¶

In this section you’ll learn how to specify the initial sphere exclusion picks, and how to save the remaining unpicked fingerprints. You will need a copy of chembl_34.fpb.

Sometimes you know which fingerprints to use a cluster centers, perhaps because of some other selection method, or because there are specific regions you want to exclude before processing. One option is to specify a --pick-id for each id to process:

% chemfp spherex chembl_34.fpb -n 5 -t 0.4 --seed 4 \
   --pick-id CHEMBL113 --pick-id CHEMBL3977198
#Spherex/1
#type=spherex num-initial-picks=2 threshold=0.4 num-picks=5 randomize=1 seed=4
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T13:32:33+00:00
i     pick_id count
1     CHEMBL113       420
2     CHEMBL3977198   1019
3     CHEMBL1715446   36
4     CHEMBL1741912   320
5     CHEMBL5193521   73

The other is to list the initial pick ids in a file, with one id per line, and specify the filename with --pick-id-file:

% cat centers.txt
CHEMBL113
CHEMBL3977198
% chemfp spherex chembl_34.fpb -n 5 -t 0.4 --seed 4 \
     --pick-id-file centers.txt --out csv
pick_id,count
CHEMBL113,420
CHEMBL3977198,1019
CHEMBL1715446,36
CHEMBL1741912,320
CHEMBL5193521,73

After sphere picking finishes, the remaining candidate fingerprints can be saved to the --save-candidates file, which lets you use sphere exclusion to carve out the specified ids, and all fingerprints similar to those ids, though make sure that -n matches the number of initial picks.

Directed sphere exclusion¶

In this section you’ll learn how to do directed sphere exclusion. You will need a copy of chembl_34.fpb and, one of the examples requires the RDKit. (See Directed sphere exclusion for an example of directed sphere exclusion using the chemfp.spherex() API function.)

Regular (undirected) sphere exclusion selects each pick at random from the remaining fingerprints. Directed sphere exclusion uses some other method to select the next pick.

The DISE paper (see Gobbi and Lee, https://doi.org/10.1021/ci025554v) suggests directing the picks based on their similarity to three reference drug structures, that is, pick the fingerprint which is closest to the first structure (sildenafil citrate), with ties broken by similarity to the second (isradipine), and then third (benazepril hydrochloride), with any remaining ties broken at random.

This biases selection towards those drug-like structures, while using sphere exclusion to improve diversity over similarity ranking.

Use the --dise flag for directed sphere exclusion following the DISE paper (note: this requires RDKit, to generate the appropriate fingerprints from the reference structure SMILES):

% chemfp spherex --dise chembl_34.fpb -n 5 -t 0.4 --seed 4 --out csv
pick_id,count
CHEMBL1737,318
CHEMBL195236,44
CHEMBL4539785,34
CHEMBL2017055,17
CHEMBL460085,43

Use --dise-references to specify your own reference fingerprints or reference structures.

Alternatively use a ranking file (with --ranks) to specify the rank order directly. This may in one of three styles, based on the filename extension or value of --rank-format.

The “tsv” format contains two tab-separated columns, where the first is the identifier and the second is the rank score (which must be a non-negative integer or float):

% cat ranks.tsv
CHEMBL113     123
CHEMBL2017055 888
CHEMBL1737    999
CHEMBL4539785 888
% chemfp spherex --ranks ranks.tsv chembl_34.fpb -n 5 -t 0.4 \
       --seed 4 --out csv
CHEMBL113,420
CHEMBL2017055,147
CHEMBL4539785,87
CHEMBL1737,135
CHEMBL17564,1

(Use --ranks-has-header to skip the first line of the ranks file, in case it contains a header.)

If no rank is specified, the value of --ranks-default is used, which in turn defaults to 4294967295.

The “txt” ranking format contains one identifier per line, which will be processed in order. This is subtly different from the --pick-id-file because the pick file will always pick a fingerprint, even if it is in the excluded sphere of another fingerprint, and because of processing differences between unpicked and unranked fingerprints.

Finally, the ranks may be a fingerprint file (eg, in “fps” or “fpb”) format, in which the fingerprints are ranked by the order in the file, so the first fingerprint come first, etc.

MaxMin diversity selection¶

In this section you’ll learn how to do MaxMin diversity selection using the chemfp maxmin command-line tool. You will need a copy of chembl_34.fpb. (See Select diverse fingerprints with MaxMin for an example of MaxMin using the chemfp.maxmin() API function.)

MaxMin (see Ashton, et al., doi:10.1002/qsar.200290002) is a method for selecting diverse fingerprints. It starts with an initial fingerprint selected from a set of candidate fingerprints. (See below for more about the initial pick.) It then finds a candidate fingerprint which is the least similar to the initial pick, that is, a fingerprint from the set of fingerprints with the lowest Tanimoto similarity to the initial pick. This candidate is added to the picked fingerprints.

The third pick is a fingerprint from the set of fingerprints which have the lowest similarity to either the initial pick and the second pick. (This algorithm is called “MaxMin” because it’s looking for a candidate fingerprint whose maximum similarity to any of the picked fingerprints is the minimum.)

This process repeats until no candidates are left, or enough fingerprints have been picked, or the minimum similarity threshold is too high, or the user gives up and stops the program.

Here are 5 of the most diverse compounds in ChEMBL 34:

% chemfp maxmin chembl_34.fpb -n 5 --times
#Diversity/1
#num_bits=2048
#type=maxmin threshold=1.0 num-picks=5 all-equal=0 randomize=1 seed=4185699608
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T14:45:26+00:00
i     pick_id score
1     CHEMBL4297424   0.0000000
2     CHEMBL4541723   0.0000000
3     CHEMBL2006665   0.0000000
4     CHEMBL1201084   0.0000000
5     CHEMBL1200727   0.0000000
T_init: 0.04 T_pick: 14.93 #picks: 5 picks/s: 0.33 T_total: 15.00

They all have a similarity score of 0.0 with each other, and the total calculation took 15 seconds, nearly all of which was spent in the MaxMin pick algorithm. (As we’ll see in a bit, nearly all of that was spent making the first pick.)

The output format, like most chemfp formats, is line-oriented, with a header section containing lines starting with a “#” followed by a data section containing lines which do not start with a “#”.

The key/value metadata in the header describes the MaxMin processing. For example, the “type” line describes the applicable MaxMin parameters, which can be used to reproduce the results, and the “candidates” line gives a hint about the source of the candidate fingerprints.

The data lines are tab-separated columns. In this case there are three columns, whose column titles are in the first data line. The first column (“i”) contains the pick index, the second (“pick_id”) the identifier for the picked fingerprint, and the third (“score”) the maximum similarity score between the pick and all previous picks.

By definition the first pick always has a score of 0.0, and the score column will never decrease.

The times in this output (the last line, which was send to stderr), report the time to initialize the internal MaxMin data structure, the total time for picking, the average time for picking, and the total MaxMin time (minus the Python startup time).

Note

You should filter out all unreasonable fingerprints before using MaxMin, because the most diverse compounds in datasets like ChEMBL are often least drug-like compounds.

The initial MaxMin pick¶

In this section you’ll learn how MaxMin does the initial pick, how to specify the initial, and learn about a handy rule-of-thumb. You will need a copy of chembl_34.fpb.

The --pick-time option adds a column to the MaxMin output with the total elapsed time for each pick:

% chemfp maxmin chembl_34.fpb -n 5 --pick-time --out tsv --times
pick_id       score   pick_time
CHEMBL1796997 0.0000000       4.15
CHEMBL3181975 0.0000000       4.15
CHEMBL1909275 0.0000000       4.16
CHEMBL4298415 0.0000000       4.16
CHEMBL4302507 0.0000000       4.16
T_init: 0.02 T_pick: 4.16 #picks: 5 picks/s: 1.20 T_total: 4.20

This says MaxMin took 4.2 seconds, of which 4.15 seconds were needed for the first pick.

Why did the first one take so long?

The MaxMin algorithm starts with an initial seed. Many algorithms pick this seed at random, which is indeed faster. To see what I mean, the following uses the --pick-index option to select the initial pick by index, where the special value -1 means to select the index at random:

% chemfp maxmin chembl_34.fpb -n 5 --pick-index -1 \
   --pick-time --out tsv --times
pick_id       score   pick_time
CHEMBL5219064 0.0000000       1.46
CHEMBL3988898 0.0000000       1.46
CHEMBL1200731 0.0000000       1.46
CHEMBL4298415 0.0000000       1.46
CHEMBL1098659 0.0000000       1.46
T_init: 0.06 T_pick: 1.46 #picks: 5 picks/s: 3.42 T_total: 1.56

That’s 2.7 seconds faster. So what is chemfp doing in those 2.7 seconds?

It’s very unlikely that a randomly picked seed will be a most diverse fingerprint (a fingerprint whose maximum similarity to every other fingerprint is the lowest). Instead, chemfp uses a novel algorithm, called “heapsweep”, to find a most diverse fingerprint, which takes time.

As an interesting observation, I’ve found that the most diverse fingerprint is often the one with the fewest bits set, so if the fingerprints are sorted by popcount (as they are by default in an FPB file) then using --pick-index 0 is a good first pick - better than a random pick, even if not guaranteed to be a best pick.

Take CHEMBL1796997 as an example, which took 4.15 seconds to identify as an initial pick. That’s the 4th fingerprint in chembl_34.fpb, while the first one is CHEMBL17564:

% fpcat chembl_34.fpb --no-metadata | \
      awk '/CHEMBL1796997$/ {print NR; exit}'
4
% fpcat chembl_34.fpb --no-metadata | awk '{print $2; exit}'
CHEMBL17564

It’s hard to tell if one initial pick is better than the other, because the MaxMin algorithm is an approximate method. At each step the next pick is arbitrarily selected from possibly many candidates with an equal MaxMin score to the existing picks (eg, at random, or based on the fingerprint index), because using a local selection method like this is faster than a method like the full heapsweep algorithm, which globally optimizing the next pick.

To get a sense for the difference, I’ll get the score for the 1,000th pick, using each of the ids, using the following:

% chemfp maxmin -n 1000 --pick-id CHEMBL17564 chembl_34.fpb | tail -1
1000  CHEMBL2074904   0.1458333

After 5 runs the sorted scores for CHEMBL17564 (the first fingerprint) are: 0.1454545, 0.1454545, 0.1456311, 0.1456311, and 0.1458333.

For CHEMBL1796997 they are 0.1456311, 0.1458333, 0.1458333, 0.1460674, and 0.1461538, which are slightly higher.

For a --pick-index -1 (a randomly selected initial seed), the sorted scores after 5 runs are 0.1454545, 0.1458333, 0.1458333, 0.1459854, and 0.1463415, which is about the same as using heapsweep for the initial pick.

On the other hand, if we want a sort of “null space”, that is, a set of fingerprints which are all 0.0 to each other, using something like:

% chemfp maxmin -t 0.0 --pick-id CHEMBL17564 chembl_34.fpb | tail -1
50    CHEMBL1233550   0.0000000

then after 10 runs we find a maximum of 56 for CHEMBL17564, a maximum of 58 for CHEMBL1796997, and a maximum of 52 for a randomly selected initial seed.

It can be quite hard to make sense of these results, and I’m not convinced I fully understand it myself.

MaxMin diversity selection including references¶

In this section you’ll learn how to use MaxMin to pick diverse fingerprints from a set of candidates which are also diverse from a set of references. You will need a copy of chembl_34.fpb, and a copy of chembl_33.fpb from the same download page. (See Use MaxMin with references for an example of MaxMin with references using the chemfp.maxmin() API function.)

The MaxMin algorithm can be applied to select diverse compounds from a vendor catalog (“the candidates”) which are also diverse from a local collection (“the references”).

For example, what fingerprint in ChEMBL 34 is the most diverse from any other fingerprint in ChEMBL 34 and is also diverse from any fingerprint in ChEMBL 33?

% chemfp maxmin -n 5 --references chembl_33.fpb chembl_34.fpb \
      --out tsv --pick-time --times
pick_id       score   pick_time
CHEMBL5308483 0.1666667       4.00
CHEMBL1235452 0.2000000       7.58
CHEMBL5282109 0.2222222       15.50
CHEMBL5267320 0.2307692       20.25
CHEMBL5271563 0.2380952       25.16
T_init: 0.03 T_pick: 25.16 #picks: 5 picks/s: 0.20 T_total: 25.20

To verify the first one, the following gets CHEMBL5308483 from ChEMBL 34 then does a k=1 similarity search with ChEMBL 33 to find that, yes, the maximum similarity score is only 0.1666667:

% fpcat chembl_34.fpb | awk '/CHEMBL5308483$/ {print; exit}' | \
        simsearch -k 1 chembl_33.fpb --no-metadata
1     CHEMBL5308483   CHEMBL3185131   0.1666667

Heapsweep diversity selection¶

In this section you’ll learn how to use the chemfp heapsweap command-line tool for global diversity picking. You will need a copy of chembl_34.fpb. (See Select diverse fingerprints with Heapsweep for an example of heapsweep using the chemfp.heapsweep() API function.)

The heapsweep algorithm is a method for diversity selection which successively picks a fingerprint with the smallest maximum similarity to all other fingerprints. Unlike the MaxMin algorithm, which is a local method which compares the candidates to the previously picked candidates, heapsweep which compares the candidates to all other fingerprints.

According to heapsweep, the 10 globally most diverse compounds in ChEMBL 34 are:

% chemfp heapsweep -n 10 chembl_34.fpb
#Diversity/1
#num_bits=2048
#type=heapsweep threshold=1.0 num-picks=10 all-equal=0 randomize=1 seed=281934854
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-25T11:10:47+00:00
i     pick_id score
1     CHEMBL1796997   0.0769231
2     CHEMBL4297424   0.0769231
3     CHEMBL4300465   0.0833333
4     CHEMBL1201290   0.0833333
5     CHEMBL1200673   0.0909091
6     CHEMBL4298559   0.0909091
7     CHEMBL1201326   0.1000000
8     CHEMBL1233877   0.1000000
9     CHEMBL4297653   0.1000000
10    CHEMBL2146126   0.1111111

The order might differ between runs because of ties, like how CHEMBL1796997 and CHEMBL4297424 have the same global score of 0.0769231. The heapsweep implementation starts with a few random sweeps, where each sweep selects a fingerprint at random and finds the score to every other fingerprint. The maximum of these scores are used to initialize the heapsweep search queue, which means the queue order depends on the randomly selected sweeps, which affects the output order.

The output format follows the usual chemfp convention of a header, where the first line identifies the format and the remaining header lines contain key/value pairs describing the search parameters and other metadata. The data lines are tab separated columns. The first data line contains the column headers, which in this case are “i” for the pick index, “pick_id” for the picked fingerprint id, and “score” for the global score.

We can verify that the nearest neighbor to CHEMBL1796997 has a similarity score of 0.0769231 using the following:

% fpcat chembl_34.fpb | awk '/CHEMBL1796997$/ {print; exit}' | \
      simsearch -k 2 chembl_34.fpb --no-metadata
2     CHEMBL1796997   CHEMBL1796997   1.0000000       CHEMBL1992013   0.0769231

It’s much more difficult to verify that it’s indeed the lowest score, without generating all other scores.

The heapsweep scores are global scores, and cannot be directly compared to the MaxMin scores, which are only computed relative to the previous picks (or 0.0 by definition for the first MaxMin pick).

The heapsweep algorithm is significantly slower than MaxMin. For example, it took nearly 13 minutes for my laptop to find the first 200 most diverse fingerprints in ChEMBL 34, and nearly 34 minutes for the first 1,000:

% chemfp heapsweep -n 1000 chembl_34.fpb --pick-time
  ...
i     pick_id score   pick_time
  ... only showing every 100th line ...
 CHEMBL4298435   0.2500000       502.10
 CHEMBL3188899   0.2702703       768.25
 CHEMBL1522189   0.2857143       992.69
 CHEMBL1163792   0.2957746       1185.38
 CHEMBL4747791   0.3037975       1340.15
 CHEMBL2136537   0.3125000       1523.69
 CHEMBL4463753   0.3188406       1666.25
 CHEMBL4062050   0.3246753       1802.26
 CHEMBL2164954   0.3333333       1960.64
CHEMBL2007107   0.3333333       2019.47

while the equiavlent for maxmin takes 13 seconds for the entire run - that’s more than 150x faster!:

% chemfp maxmin -n 1000 chembl_34.fpb --pick-time
    ...
 i    pick_id score   pick_time
    ... only showing every 100th line ...
CHEMBL1878565   0.0684932       6.10
 CHEMBL1096422   0.0957447       7.62
 CHEMBL1927792   0.1111111       8.78
 CHEMBL543312    0.1186441       9.51
 CHEMBL3337642   0.1250000       10.20
 CHEMBL80560     0.1309524       10.85
 CHEMBL1321270   0.1363636       11.54
 CHEMBL575927    0.1397849       12.08
 CHEMBL1214299   0.1428571       12.58
CHEMBL4749940   0.1458333       13.07

I don’t think you should use heapsweep as a general-purpose diversity selection algorithm.

In practice, most people want picks which are highly diverse from each other. My experiments show the intra-group diversity of the MaxMin picks (because it uses a heuristic optimized for intra-group diversity) is higher than the heapsweep picks (which includes global information).

Furthermore, heapsweep is a single-threaded implementation with a lot of random memory lookups, which are a lot slower than the sequential lookups of simsearch or simarry, both of which are parallelized. It takes simarray less than a day to compute the entire NxN ChEMBL matrix, so if you really want the ordered ranking for a large number of fingerprints then don’t hesitate to go ahead and compute everything.

On the other hand, heapsweep is very useful for picking a globally most diverse fingerprint as the initial seed to MaxMin, because that only takes a few seconds. That’s why chemfp maxmin uses it to pick the initial seed, if none is specified.

If there are global heapsweep ties then MaxMin will choose the initial seed arbitrarily from them. If you want to explore how MaxMin changes based on the seed, you can use heapsweep to report all ties by asking for one pick (-n 1), as well as all other picks with the same score (--all-equal):

% chemfp heapsweep -n 1 --all-equal chembl_34.fpb --out tsv
pick_id       score
CHEMBL1796997 0.0769231
CHEMBL4297424 0.0769231

Now that you know the ids, use chemfp maxmin --pick-id $ID .. to specify each one as the initial seed.

Structure file format translation¶

In this section you’ll learn how to use chemfp translate to convert from one structure file format to another.

Chemfp has a cross-toolkit API for reading a stream of toolkit-specific molecules from a structure file. These are used to generate fingerprint from a structure file. It also has a cross-toolkit API for writing molecule to a structure file. It was originally developed to add a fingerprint to a tag data field of an SD record, and later developed as a more general-purpose writer.

The chemfp translate command-line tool puts the reader and writer together to make a structure file converter. The following converts a SMILES file record to an InChI record, and then to an SDF record:

% echo "C methane" | chemfp translate --out inchi
InChI=1S/CH4/h1H4 methane
methane
     RDKit

  1  0  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  END
$$$$

By default it uses the “rdkit” toolkit if available, otherwise “openbabel”, otherwise “openeye”, otherwise “cdk”. Use the -T option to specify one or more toolkit names, as a comma-separated list, to change the order. The following uses the CDK to convert the methane input into SDF V3000 output:

% echo "C methane" | chemfp translate -T cdk --out sdf3k
methane
  CDK     0625241606

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1 0 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 END ATOM
M  V30 END CTAB
M  END
$$$$

The default is to read a SMILES file from stdin and write a SMILES file to stdout. Input and output filenames can be specified on the command-line, in which case the extensions are used to identify the format, with SMILES as the default. Use --in and --out to specify the formats directly.

For example, the following converts the ChEBI SDF to a SMILES. The SDF title line does not contain a record identifier so it uses the --id-tag to tell chemfp that the “ChEBI ID” data tag value contains the title:

% chemfp translate ChEBI_lite_3star.sdf.gz --id-tag "ChEBI ID" -o chebi_lite.smi
  ... lots of RDKit processing messages ...
% head -2 chebi_lite.smi
Oc1cc(O)c2c(c1)O[C@H](c1ccc(O)c(O)c1)[C@H](O)C2 CHEBI:90
CC1(C)C(=O)[C@@]2(C)CC[C@@H]1C2 CHEBI:165

Structure translation reader and writer args¶

In this section you’ll learn how to use the -R and -W options to set the reader and writer arguments to chemfp translate.

Many of the structure readers and writers support configuration options. For example, the SMILES readers have a “delimiter” option which affects how to parse the fields in a line. The default is “to-eol”, which use the first terms for the SMILES or CXSMILES, and the rest of the line as the title, which in the following is “water molecule”:

% echo "O water molecule" | chemfp translate
O water molecule

Use the -R option to set reader arguments, as “key=value” pairs. The “whitespace” delimiter argument says to split the remaining terms on the line by whitespace, and use the first term as the id, which in the following is only “water”:

% echo "O water molecule" | chemfp translate -R delimiter=whitespace
O water

Many of the options are toolkit and format specific. For example, the CDK SMILES reader has a “hydrogens” reader argument which, if “make-explicit”, will make all implicit hydrogens become explicit hydrogens:

% echo "O water" | chemfp translate -T cdk
O water
% echo "O water" | chemfp translate -T cdk -R hydrogens=make-explicit
O([H])[H] water

Writers may have their own arguments, which are specified with -W. As a simple one, the InChI writers by default include the record id after the InChI string:

% echo "C methane" | chemfp translate -T openbabel --out inchi
InChI=1S/CH4/h1H4 methane

Use the “include_id” writer argument to configure that option, with “1” for on (the default) and “0” for off:

% echo "C methane" | chemfp translate -T openbabel --out inchi -W include_id=0
InChI=1S/CH4/h1H4

As a fun one, Open Babel’s “ascii” output format writes a structure diagram using ASCII characters. The default width is 79 characters, but I can specify an alternative using a Open Babel “options” string (formatted for use by OBConversion::SetOptions), which sets the width to 20:

% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | \
           chemfp translate -T openbabel --out ascii -W 'options=h"20"'

             O

             ||             /
             ||            |
             \_            |
            _/ \_        N
          _/     \_    __  \
      HN /         \__/     \
                   ||        \
       |           ||       / /
       |           ||      / /
       |          _/_     / /
      _\_       _/   \___  /
   __//  \_    /         N
O  /       \ N

             |
             |
             |

Note that the double quoting (with single quotes containing double quotes) is used to prevent the shell from parsing the double quote that Open Babel needs when parsing the options string.

Structure translation with two toolkits¶

In this section you’ll learn how to have chemfp translate use one toolkit to read molecules from source one format then convert them via an intermediate format so they can be written by another toolkit.

This section is mostly for the fun of being able to do it. I would like to know if you actually find it useful.

Suppose you wanted to convert a Maestro record into CML. The RDKit supports Maestro files, but not CML, which is only supported by Open Babel and CDK.

One way, which isn’t so useful, is to use chemfp translate to convert the file (in this case a test file from the RDKit test suite) into SMILES:

% chemfp translate -T rdkit NCI_aids_few.mae | head -5
CCC1=[O+][Cu@]2([O+]=C(CC)CC(CC)=[O+]2)[O+]=C(CC)C1 48
C(=C\c1ccccc1)\C1=[O+][Cu@@]2([O+]=C(/C=C/c3ccccc3)CC(c3ccccc3)=[O+]2)[O+]=C(c2ccccc2)C1 78
CC(=O)N1c2ccccc2Sc2c1ccc1ccccc21 128
Nc1ccc(/C=C/c2ccc(N)cc2[S@](=O)(=O)O)c([S@@](=O)(=O)O)c1 163
O=[S@@](=O)(O)CC[S@@](=O)(=O)O 164

then pipe the SMILES output so CDK can translate the result to CML:

% chemfp translate -T rdkit NCI_aids_few.mae | \
       chemfp translate -T cdk -o out.cml
% head -5 out.cml
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
  <atomArray>
    <atom id="a1" elementType="C" formalCharge="0" hydrogenCount="3"/>
    <atom id="a2" elementType="C" formalCharge="0" hydrogenCount="2"/>

I say isn’t so useful because the input contains coordinates which were lost through the conversion via SMILES. Better would be to use SDF, preferably as V3000, to preserve as much as possible. I’ll do this by changing the --out of the first program and --in of the second to “sdf3k”:

% chemfp translate -T rdkit NCI_aids_few.mae --out sdf3k | \
       chemfp translate -T cdk --in sdf3k -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
  <atomArray>
    <atom id="1" elementType="C" x3="1.397061" y3="-2.611162" z3="-1.31402"
formalCharge="0" hydrogenCount="2"/>
    <atom id="2" elementType="C" x3="0.617928" y3="-3.334828" z3="-0.196596"
formalCharge="0" hydrogenCount="3"/>

You likely also noticed the atom order has changed, as a consequence of SMILES canonicalization and/or the way RDKit generated the SMILES spanning tree.

Another issue is error reporting. The first toolkit could fail to parse an input record, or generate the intermediate record, and the second toolkit fail to parse that intermediate record or generate the final output record. If the first toolkit skip records it can’t process, then the record numbers for the first and second toolkit will differ, because the second toolkit knows nothing about the skipped records. This results in potentially confusing error messages.

Take for example this SMILES file, which I’ll convert to InChI. Note that “Q” is not a valid SMILES and “*” cannot be represented in InChI:

% cat test.smi
C first
Q second
* third
O fourth
% chemfp translate -T rdkit test.smi --out sdf3k | chemfp translate -T cdk --in sdf3k --out inchi
[21:08:23] SMILES Parse Error: syntax error while parsing: Q
[21:08:23] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Error: RDKit cannot parse the SMILES 'Q', file 'test.smi', line 2, record #2: first line is 'Q second'. Skipping.
Error: CDK cannot create the InChI string (input title='third'), file '<stdin>', record #2. Skipping.
InChI=1S/CH4/h1H4 first
InChI=1S/H2O/h1H2 fourth

The two “SMILES Parse Error” lines come from the RDKit, from the C++ layer, when parsing the second record.

The “Error: RDKit cannot parse” line come from chemfp, because by default it gives a report about any processing errors, including location information, then goes on to the next reocrd. In this case it says there was an error in record #2 (on line 2) because the SMILES “Q” could not be parsed.

The “Error: CDK cannot create” line also comes from chemfp, again because it reports any processing errors, in this case because it can’t convert the “*” into an InChI. But see how it says the error is again in record #2? That’s because the second input record was skipped, so the CDK stage is actually only processing input records 1, 3, and 4.

The last two lines are the InChIs for the methane and water molecules. Thes were sent to stdout while the other messages were sent to stderr.

Instead, the translate command has built-in support for this two-stage processing, like the following:

% chemfp translate -T rdkit -U cdk test.smi --out inchi
[21:11:50] SMILES Parse Error: syntax error while parsing: Q
[21:11:50] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Error: RDKit cannot parse the SMILES 'Q', file 'test.smi', line 2, record #2: first line is 'Q second'. Skipping.
Error: CDK cannot create the InChI string (input title='third'), file 'test.smi', line 3, record #3: first line is '* third'. Skipping.
InChI=1S/CH4/h1H4 first
InChI=1S/H2O/h1H2 fourth

The -U option specifies the second toolkit to use, so this uses the RDKit toolkit for input and the CDK toolkit for output. By default uses “sdf3k” as the intermediate file format, so the MAE to CML conversion is simply:

% chemfp translate -T rdkit -U cdk NCI_aids_few.mae -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
  <atomArray>
    <atom id="a1" elementType="C" x3="1.3971" y3="-2.6112" z3="-1.314"
formalCharge="0" hydrogenCount="2"/>
    <atom id="a2" elementType="C" x3="0.6179" y3="-3.3348" z3="-0.1966"
formalCharge="0" hydrogenCount="3"/>

Use --via option to change the intermediate format; the following uses the canonical isomeric SMILES, so the coordinates are no longer present:

% chemfp translate -T rdkit -U cdk --via smi NCI_aids_few.mae -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
  <atomArray>
    <atom id="a1" elementType="C" formalCharge="0" hydrogenCount="3"/>
    <atom id="a2" elementType="C" formalCharge="0" hydrogenCount="2"/>

Working with the chemfp command-line tools¶

Generate fingerprint files from PubChem SD tags¶

Generate fingerprint files from CSV files¶

k-nearest neighbor search¶

simsearch CSV output¶

Threshold search¶

simsearch CSV output when no hits¶

Combined k-nearest and threshold search¶

Saving simsearch results to “npz” format¶

NxN (self-similar) searches¶

Using a toolkit to process the ChEBI dataset¶

ChEBI record titles don’t contain the id¶

The --id-tag option¶

Generate fingerprints with Open Babel¶

Generate fingerprints with OpenEye¶

Generate fingerprints with RDKit¶

Generate fingerprints with CDK¶

Use structures as input to simsearch¶

Make new fingerprints matching the type in an existing file¶

Alternate error handlers¶

chemfp’s two cross-toolkit substructure fingerprints¶

substruct fingerprints¶

Generate binary FPB files from a structure file¶

Convert between FPS and FPB formats¶

Specify the fpcat output format¶

Alternate fingerprint file formats¶

The FPB format¶

Licensed FPB files¶

Get licensed FPB files containing ChEMBL 34 fingerprints¶

Similarity search with the FPB format¶

Performance breakdown¶

Multi-core similarity search¶

Converting large data sets to FPB format¶

Faster gzip decompression¶

Generate fingerprints in parallel and merge to FPB format¶

Generate a full Tanimoto similarity array¶

Generate the array using another metric or datatype¶

Generate a simarray using an “abcd” metric¶

Simarray and arrays larger than memory¶

Butina on the command-line¶

Pruning the number of Butina clusters¶

Alternate Butina output formats¶

Butina parameter tuning with npz files¶

Number of Butina clusters in ChEMBL¶

Sphere exclusion¶

Sphere exclusion output options¶

Sphere exclusion with initial picks, and saving candidates¶

Directed sphere exclusion¶

MaxMin diversity selection¶

The initial MaxMin pick¶

MaxMin diversity selection including references¶

Heapsweep diversity selection¶

Structure file format translation¶

Structure translation reader and writer args¶

Structure translation with two toolkits¶

The `--id-tag` option¶