Command-line examples for binary fingerprints
The sections in this chapter show examples of using the chemfp command-line tools to generate and work with binary fingerprint files. Examples of using the command-line tools for sparse count fingerprints are in its own chapter.
csv background
This section provides background about what “csv” means in a chemfp context. It sets the stage for the next sections.
The term “csv” comes from the initialism for “Comma-Separated Values, but is also used to describe several other related formats, like tab-separated values, so is something interpreted as “Character-Separated Values”. I will use to mean the files that can be parsed with Python’s csv module .
In chemfp, the “csv” or “excel” dialect is a comma-separated format following the quoting rules that Excel uses. (Quoting rules are used to, for example, allow a comma in a field without it being intepreted as a separated.) Both “csv” and “excel” are implemented using Python’s “excel” dialect.
In chemfp, the “tsv” or “excel-tab” dialect is a tab-separated format following the quoting rules that Excel uses. Both “tsv” and “excel-tab” are implemented using Python’s “excel-tab” dialect.
There is also a “unix” dialect, implemented using Python’s “unix_dialect” dialect, which is another comma-separated dialect.
Python’s csv module lets you specify your own csv dialect, with options for which delimiter character to use, how escaping and quoting work, and whether or not initial spaces should be ignored. These are:
delimiter: The character used to separate fields
escapechar: On reading, the escapechar removes any special meaning from the following character
quotechar: A one-character string used to quote fields containing special characters
quoting: Controls if the reader should recognize quoting
skipinitialspace: If True, ignore leading spaces in a field.
For details see the Python documentation,
csv2fps command intro
In this section you’ll learn about the chemfp csv2fps command to read structures or fingerprints from a csv file.
CSV files are very common, with comma-separated and tab-separated as the most common. They can be both easy and tricky to parse because there are so many variations of what “csv” means.
The csv2fps command supports extracting identifiers and structures from a CSV file to generate fingerprints and generate a fingerprint file. The following reads a comma-separated CSV file from stdin, uses the first column as the identifier and the second as the SMILES, then uses RDKit to generate the MACCS key fingerprint, with the output sent to stdout in FPS format:
% echo "ethane,CC" | chemfp csv2fps --type RDKit-MACCS166 --no-header
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-09-23T08:30:48+00:00
000000000000000000000000000000000000108000 ethane
The --no-header is needed because the default assumes the first
line is a header containing column titles, while this example input
does not have a header.
The columns can be referred to by index or column title. For example, if the input is the three lines
CAS,smiles,name
74-84-0,CC,ethane
74-98-6,CCC,propane
then the following uses the “name” column for the id and column 2 (column numbers start at 1) for the molecule:
% printf "CAS,smiles,name\n74-84-0,CC,ethane\n74-98-6,CCC,propane\n" | \
chemfp csv2fps --id-col name --mol-col 2 --type RDKit-MACCS166
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2024.09.5 chemfp/5.0b2
#date=2025-09-23T08:41:05+00:00
000000000000000000000000000000000000108000 ethane
000000000000000000020000000002000000108400 propane
It’s possible to have a column header “2” and a column title “2” which
is not the second column, so there’s a special notation to refer to
columns by “@name” or “#index”. (To specify column “@A” use “@@A” ,
for the 2nd column use “#2”, for column named “#2” use “@#2”.) The
following uses --no-metadata to exclude the metadata header from
the FPS output.
% printf "CAS,smiles,name\n74-84-0,CC,ethane\n74-98-6,CCC,propane\n" | \
chemfp csv2fps --id-col '@CAS' --mol-col '#2' --type RDKit-MACCS166 --no-metadata
000000000000000000000000000000000000108000 74-84-0
000000000000000000020000000002000000108400 74-98-6
Instead of structures, the CVS file may contain pre-computed
fingerprints. The following reads a tab-separated input (“tsv”
dialect) where the first column, “microfp”, contains a 14-bit
fingerprint in --binary format, where the bits are encoded with
the characters “0” and “1” and the first bit comes first. The output
contains the bits in chemfp’s mixed-endian format, with the
“#num_bits” metadata to indicate that the original fingerprint have
only 14 bits, not the 16 implied by the hex-encoding. (The additional
bits are zero-padded.)
% printf "microfp\tname\n01011001000100\tABC\n" | \
chemfp csv2fps --dialect tsv --fp-col 1 --binary --id-col 2 --no-date
#FPS1
#num_bits=14
9a08 ABC
The microfingerprint “01011001000100” get padded to 16 bits as “0101 1001 0001 0000” (with spaces at the 4-bit boundaries for clarity), converted to the chemfp bit order as “1001 1010 0000 1000”, and then to hex as “9a08”.
There are many CSV variants. The default --dialect is “csv”, which
is the same as “excel”. The other two options are “tsv” (also
available as “excel-tab”) and “unix”.
The initial dialect can be further modified with other options, like
--separator. Certain characters are difficult to write on the
command-line so the --separator, --quotechar and
--escapechar also accept the aliases “tab”, “backslash”, “space”,
“quote”, “doublequote”, “singlequote”, or “bang”.
(There is also a special “sniff” dialect which uses Python’s “Sniffer” to read the start of the file and uses a number of heuristics to determine the dialect parameters, and to determine if the first line is a header.
My experience has been a bit iffy, and its success strongly depends on the dataset you have. It’s best used with chemfp csv2fps --describe to get an initial idea of the actual format than to use it as the default dialect.)
MolPort csv file
In this section you’ll learn about the tab-delimited format which MolPort uses in their downloadable database. You will need to contact them to get access to the file.
Some datasets are in csv format with a structure (usually SMILES or InChI) stored in one of the columns, and an identifier stored in another. One such example is the MolPort distribution, which I’ll use for these examples.
It is a tab-delimited format with a header line containing titles, followed by the data. Here’s an example, which unfortunately overflows the available display space, so I’ve trimmed it the first 70 or so characters:
% gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | head -2
SMILES SMILES_CANONICAL MOLPORTID STANDARD_INCHI INCHIKEY...
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 Cl.CCC(C(=O)OC1CC2CCC(C...
That’s still enough to see there are a couple of SMILES fields, a MolPort id, and an InChI. You can also see it’s a tab-separated file, though I’ve converted them to multiple spaces for this output.
There are a lot of tools for working with CSV files, including xan, and qsv. Those are very fast and flexible, with ways to filter, join, manipulate, and visualize the data. For example, the following line for the Bash shell reads the first two records (three lines) from the gzip-compressed file fulldb_smiles-009-000-000–009-499-999.txt.gz and “flattens” them so they are easier to read:
$ xan flatten -d $'\t' -l 2 --cols 75 fulldb_smiles-009-000-000--009-499-999.txt.gz
Row n°0
───────────────────────────────────────────────────────────────────────────
SMILES Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1
SMILES_CANONICAL Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)c1ccccc1
MOLPORTID MolPort-009-675-630
STANDARD_INCHI InChI=1S/C18H25NO2.ClH/c1-3-17(13-7-5-4-6-8-13)18(20)21-16
-11-14-9-10-15(12-16)19(14)2;/h4-8,14-17H,3,9-12H2,1-2H3;1H
INCHIKEY VEUYDINLFCGWRM-UHFFFAOYSA-N
AVAILABILITY in stock
Row n°1
───────────────────────────────────────────────────────────────────────────
SMILES CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1
SMILES_CANONICAL CC(NC(=O)C(Cc1ccccc1)NC(C)=O)c1ccccc1
MOLPORTID MolPort-009-675-631
STANDARD_INCHI InChI=1S/C19H22N2O2/c1-14(17-11-7-4-8-12-17)20-19(23)18(21
-15(2)22)13-16-9-5-3-6-10-16/h3-12,14,18H,13H2,1-2H3,(H,20,23)(H,21,22)
INCHIKEY BAHWBLGZSGIKQC-UHFFFAOYSA-N
AVAILABILITY in stock
That used the Bash’s ANSI-C quoting to insert the tab character. There are other ways to insert the tab character, depending on your shell and/or terminal emulator.
The following “select”s only the SMILES and MOLPORTID columns (from a tab-delimited file). The output is comma-separated so I’ll use “fmt” to reformat to tab-separated, then show the first three output lines:
% xan select SMILES,MOLPORTID -d $'\t' fulldb_smiles-009-000-000--009-499-999.txt.gz | \
xan fmt -t $'\t' | head -3
SMILES MOLPORTID
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 MolPort-009-675-630
CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1 MolPort-009-675-631
While it’s not hard to use one of these tools or your own script or one-linear to convert a CSV file to a SMILES file, it’s likely simpler to let chemfp read it using the chemfp csv2fps command.
csv2fps --describe
In this section you’ll learn how to use chemfp csv2fps to describe how it will interpret a CSV file.
This section uses the MolPort CSV “fulldb_smiles-009-000-000–009-499-999.txt.gz”. Because that is a long name I will store it in the shell variable “FILENAME”.
When getting started with a new CSV file, use the special
--describe option to have chemfp describe what it thinks the file
contains. By default it parses the file as the “csv” dialect, that is
comma-delimited, with a header. It tries to show the titles and
columns for the first data line:
% chemfp csv2fps --describe $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: csv
delimiter: ',' (default)
quotechar: '"' (default)
escapechar: None (default)
doublequote: True (default)
skipinitialspace: False (default)
quoting: minimal
has_header: True
Titles:
#1: 'SMILES\tSMILES_CANONICAL\tMOLPORTID\tSTANDARD_INCHI\tINCHIK ... '
Columns for the first row:
#1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1\tCl.CCC(C(=O)OC1CC ... '
#2: '14-17H'
#3: '3'
#4: '9-12H2'
#5: '1-2H3;1H\tVEUYDINLFCGWRM-UHFFFAOYSA-N\tin stock'
Long strings are trimmed, and replace with the “…”, so they easily fit into an 80 column display.
The output fields are printed using Python’s string encoding. The many “\t” characters is strong evidence that the file is tab-separated.
I can use the “sniff” dialect to have Python’s csv module use heuristics to identify the dialect:
% chemfp csv2fps --describe --dialect sniff $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect:
delimiter: '\t'
quotechar: '"' (default)
escapechar: None (default)
doublequote: False
skipinitialspace: False (default)
quoting: minimal
has_header: True
Titles:
#1: 'SMILES'
#2: 'SMILES_CANONICAL'
#3: 'MOLPORTID'
#4: 'STANDARD_INCHI'
#5: 'INCHIKEY'
#6: 'AVAILABILITY'
Columns for the first row:
#1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1'
#2: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)c1ccccc1'
#3: 'MolPort-009-675-630'
#4: 'InChI=1S/C18H25NO2.ClH/c1-3-17(13-7-5-4-6-8-13)18(20)21-16- ... '
#5: 'VEUYDINLFCGWRM-UHFFFAOYSA-N'
#6: 'in stock'
or I can specify the “tsv” dialect, which gives the same titles and same columns for the first line.
Use --no-header if the CSV file does not start with a header. For
this file that option produces the following:
% chemfp csv2fps --describe --dialect tsv $FILENAME --no-header
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: tsv
delimiter: '\t'
quotechar: '"' (default)
escapechar: None (default)
doublequote: True (default)
skipinitialspace: False (default)
quoting: minimal
has_header: False
Titles: not available
Columns for the first row:
#1: 'SMILES'
#2: 'SMILES_CANONICAL'
#3: 'MOLPORTID'
#4: 'STANDARD_INCHI'
#5: 'INCHIKEY'
#6: 'AVAILABILITY'
csv character encoding
The “character set” describes how the text in the CSV file is encoded as a sequence of bytes. By default csv2fps assumes the CSV file is UTF-8 encoded.
This may be a problem with some programs which write data in another encoding, like “utf16” and “cp1252” for Windows or “latin1” for older Unix tools.
Use --encoding to specify the input encoding, which must be one of
the supported Python encoding.
Use --encoding-errors to describe how to handle input which could
not be decoded. This must be one of the supported Python encoding
handlers
“strict”, “ignore”, “replace”, or “backslashreplace”.
csv2fps column specifiers
In this section you’ll learn how to use chemfp csv2fps to describe how it will interpret a CSV file.
This section uses the MolPort CSV “fulldb_smiles-009-000-000–009-499-999.txt.gz”. Because that is a long name I will store it in the shell variable “FILENAME”.
The default for the csv2fps command is to parse the file in “csv” dialect (or “tsv” if the file ends in “.tsv”), treat the first line as a header, get the identifers from the first column and SMILES from the second column, and generate RDKit Morgan2 fingerprints. I know the file is tab-delimited (see the previous section), so I’ll specify “tsv” format and show the first two lines of fingerprint output:
% chemfp csv2fps $FILENAME -d tsv | head -8
#FPS1
#num_bits=2048
#type=RDKit-Morgan/2 fpSize=2048 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2025-09-23T10:16:56+00:00
0200100000000000000081000000800000000800001000080000000002000000000000
0048000000100004000000002020000000000000000000080000000000000000000000
4000000000000000000000040000000080000000000000000000100000008800000040
0000000000000000000040800000000000000000000008000000040202000001000000
4000000200000000008000000000000000000000001040000020000000002000100000
0000000000000000000000000000000000040100000020000000000002800000000000
0000000000000000400000000000000000000000000000000200000040300000000802
0000000000000000100000 Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1
0200200001020000008001000000200000000000000000020000001002002000000000
0800000000000000000000000020000000100000000000000080000020000000000000
0020000000000000000000040000000000000000000001000000000000408000000000
0000020000000000000000000000000000000000000002000000000204000001004000
0000000001000000008000000000000000000000000000000000000000012000100000
0000000000000000000000000000000000000000000020000000000108000000000000
0080000000000000400400000000008000000000000000010200010000200002000800
0000000000400000000000 CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1
This ended up interpreting the “‘SMILES” column as the id, and the “SMILES_CANONICAL” column as the SMILES, which is why csv2fps is able to generate output, though likely most people want the id from the MOLPORTID column.
Use --id-column (or --id-col for short) to specify the id
column to use. In the following I’ll specify it by title, and I’ll
disable the header lines (the “metadata”) using the --no-metadata,
so the output contains just the first two lines of fingerprint data:
% chemfp csv2fps -d tsv $FILENAME --id-col MOLPORTID --no-metadata | head -2 | fold
02001000000000000000810000008000000008000010000800000000020000000000000048000000
10000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000100000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002800000000000000000000000000040000000000000000000000000000000020000004030
00000008020000000000000000100000 MolPort-009-675-630
02002000010200000080010000002000000000000000000200000010020020000000000800000000
00000000000000002000000010000000000000008000002000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000001200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000800000000000000001020001000020
00020008000000000000400000000000 MolPort-009-675-631
To specify a column title the file must have a header.
You can also specify the column index by column number, starting
from 1. I’ll use --molecule-column (or --mol-col for short) to
read the SMILES from column #2 (“SMILES_CANONICAL”), and
--id-col to read the SMILES from column #3:
% chemfp csv2fps -d tsv $FILENAME --id-col 3 --mol-column 2 --no-metadata | head -2 | fold
02001000000000000000810000008000000008000010000800000000020000000000000048000000
10000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000100000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002800000000000000000000000000040000000000000000000000000000000020000004030
00000008020000000000000000100000 MolPort-009-675-630
02002000010200000080010000002000000000000000000200000010020020000000000800000000
00000000000000002000000010000000000000008000002000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000001200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000800000000000000001020001000020
00020008000000000000400000000000 MolPort-009-675-631
If the specified column name is an integer then it is treated as an index, with 1 indicating the first column (0 and negative values are out of range).
If the first character of the column specifier is an “@” then the rest of the specifier is used as the column title. For example, if the second column is titled “10” then use “@10” as the column specifier. If the column title is “@A” then use “@@A” as the column specifier.
If the first character of the column specifier is a “#” then the rest of the specifier must be an integer, which is interpreted as the column index. For example, use “#3” to match the third column. If the column title starts with a “#” then use the “@” prefix (eg, if the column is “#entry” then specify it as “@#entry”).
structure formats with embedded identifiers
The previous examples read the id from one column, and the molecule from another, as a SMILES or InChI string. Sometimes the column contains a more complex record, like a SMILES record with both a SMILES string and identifier, or even as an embedded SDF. (The CSV format supports multi-line column entries.)
Use the --id-from-molecule option (or --id-from-mol for short)
to have csv2fps get the identifier from the parsed molecule record.
For a SMILES or InChI record this defaults to the rest of the column
entry. This can be changed with the --delimiter option.
If the column entry is an SDF record then by default the identifier
comes from the title line. Use --id-tag to get the identifier from
the named data value.
The following two examples use the SMILES record “c1ccccc1O benzene” which contains the SMILES string and an identifier. That is the first of three columns, where the other two are the corresponding CAS and ChEMBL ids.
The first uses column 1 for both the molecule input (as specified) and as the identifier (the default), which is why the fingerprint output has the id “c1ccccc1 benzene” – it’s the first column.
% echo "c1ccccc1 benzene,71-43-2,ChEMBL277500" | \
chemfp csv2fps --type "RDKit-Morgan fpSize=128" --mol-col 1 \
--no-header --no-metadata
20000000000000000100030000000000 c1ccccc1 benzene
The second specifies --id-from-mol to get the identifier from the
molecule, so column 1 is used for molecule input, which uses
“c1ccccc1” as the structure and “benzene” as the structure’s id, which
then becomes the fingerprint id.
% echo "c1ccccc1 benzene,71-43-2,ChEMBL277500" | \
chemfp csv2fps --type "RDKit-Morgan fpSize=128" --mol-col 1 \
--id-from-mol --no-header --no-metadata
20000000000000000100030000000000 benzene
csv2fps input structure format
In this section you’ll learn how to have chemfp csv2fps read input structures in a format other than SMILES.
This section uses the MolPort CSV “fulldb_smiles-009-000-000–009-499-999.txt.gz”. Because that is a long name I will store it in the shell variable “FILENAME”.
By default csv2fps parses the molecule column as SMILES (in “smi”
format) . Use --format to specify another format. For example, the
MolPort file column STANDARD_INCHI contains an InChI string, and the
INCHIKEY column contains the corresponding InChIKey. I’ll read the
structures from the former column, and the id from the latter, and
have chemfp parse the molecule as “inchi”:
% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
--format inchi --no-metadata | head -2 | fold
02001000000000000000810000008000000008000010000800000000020000000000000048000000
10000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000100000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002800000000000000000000000000040000000000000000000000000000000020000004030
00000008020000000000000000100000 VEUYDINLFCGWRM-UHFFFAOYSA-N
02000000010000000080010000000000000000000000000000000000020000000010000800000000
00000000000000002000000010000000000000000000000840000000000000000004000000000000
00000000000000040000210020000010000000008000000000000002000000004000000000000000
00000000000000020000000002040080010000000000004000000000008000000000000000000000
00000000000000000000200010000000000010000000000100000000000000000100000000200000
00000000000000000000000000000008000440040800080000000001000000020001020000000200
00000008000000000000000000000000 BAHWBLGZSGIKQC-UHFFFAOYSA-N
If you leave out the format, or specify “smi” format, then it will try to interpret the InChI strings as SMILES, which will fail. As a safety mechanism, csv2fps will exit if none of the molecules in the first 100 records can be parsed. That looks like:
% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
--format smi --no-metadata | head -2 | fold
... many error lines removed ...
[11:56:09] SMILES Parse Error: syntax error while parsing: InChI=1S/C10H8C
l2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)
[11:56:09] SMILES Parse Error: Failed parsing SMILES 'InChI=1S/C10H8Cl2N4/
c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)' for in
put: 'InChI=1S/C10H8Cl2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5
H,(H3,13,14,15,16)'
Error: Each of the first 100 records contained structure errors.
Final error: RDKit cannot parse the SMILES 'InChI=1S/C10H8Cl2N4/c11-6-3-1-
2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)', column 'STANDAR
D_INCHI' (#4), record 100, line 101, 'fulldb_smiles-009-000-000--009-499-9
99.txt.gz'.
Exiting.
Use the -R command-line option to specify toolkit- and
format-specific reader arguments. As a short-cut to enable or disable
CXSMILES extension support, use --cxsmiles (enabled by default)
and --no-cxsmiles.
csv2fps processing errors
csv2fps can run into two main types of problems when processing a row of the file: 1) the row may not have enough colums, and 2) the toolkit may be unable to parse the molecule column or the fingerprint column.
The --errors option specifies how to handle molecule and
fingerprint parse errors. The default is “report” for molecules and
“strict” for fingerprints. The “strict” handler prints an error
message and stops processing. The “report” handler prints an error
message and keeps on processing. The “keep” handler simply keeps on
processing.
The --csv-errrors option specifies how to handle CSV parse
errors. The default is “strict”, which requires the id and molecule or
fingerprint column to exist, or print an error message and exit if
not. It can also be “report” and “keep”.
Specify csv2fps fingerprint type
In this section you’ll learn how to specify the cvs2fps fingerprint type.
The examples so far generated RDKit Morgan fingerprints. To use an
alternative name specify a chemfp type string using --type or use
the --with option to get the fingerprints from the metadata of a
fingerprint file.
The following uses Open Babel to generate FP2 fingerprints:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type OpenBabel-FP2 | \
head -8 | fold
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/3.1.0 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:21:04
00260200002002000600060000020000001000100000000000000080010400001000500080000001
100a001820010100010200000020c004000000020000a20808100800000004400000880002800005
0090015000003008c008080000080000400000200000000010000000400008000007010000011000
0000000140000000 MolPort-009-675-630
00060000004004200b0006000001000c0000009000000000000000c0000800001800408300000000
00862038000100800102000008000000000000020000000000000400000400000020000002880006
00b0004000003008c0080900800800000c1000220000000000000000000000000006010400011000
0032800000000000 MolPort-009-675-631
The following uses CDK to generate CDK’s own implementation of the 881-bit PubChem fingerprints:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type CDK-Pubchem | \
head -8 | fold
#FPS1
#num_bits=881
#type=CDK-Pubchem/2.9
#software=CDK/2.11 chemfp/5.0
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#source=
#date=2025-09-23T10:40:42+00:00
075e0c002000000000000000000000000080060000003c0200006000000000800000007800000000
00b03c8719604c10c10020001140044b100040000004000010118010001110044c01a90861040064
03801111e01913077101000000000000000000000000000000000000000000 MolPort-009-675-
630
The chemfp type string may contain fingerprint parameters. The following uses OEChem to generate OEGraphSim circular fingerprints, folded to 128 bit, instead of the default of 4096 bits:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 \
--type "OpenEye-Circular numbits=128" | head -8 | fold
#FPS1
#num_bits=128
#type=OpenEye-Circular/2 numbits=128 minradius=0 maxradius=5 atype=Arom|AtmNum|C
hiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:24:44
905a04205c4473c607f95c06123b2026 MolPort-009-675-630
a07873c1407c5c7617830d8428910170 MolPort-009-675-631
Finally, here’s an example using --with to get the fingerprint
file from a file:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --using reference.fps | \
head -7 | fold
#FPS1
#num_bits=4096
#type=OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 atype=Arom|AtmNum|
Chiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:29:18
00000000000000000000000000000000000000000000080000000010000000000000000800000000
00200000000040000000000000000000000000000000000000000000000008000004000001080000
00000200000000000000000200000000000000000000100000000000002000200000000000010800
00008000000000000000000000000000000000008000040000000000040000000000000000002000
00008000020000000000000000000000000000040000000000000000000000000000000000800000
00000000000080000000000000000000000000000000000000000000000000000000000000000000
00800000000002000000000000000020002000000100000000080000000000000000000000000000
00000000000000000000000000000000000020000000000000000000000000000000080000000000
00000000000080008000000000000008000000002400000000000000000000000000000000020000
00000000000000004000000100100000000048000801840000000000001000000000020000000000
00000000000000000000000000000000000000000000000000000000000000000000000000001000
20008000000000000000000000200000000000000000000000000000000000000400000000000000
0020000804000000000000000000000000000000000000000000000000000000 MolPort-
009-675-630
Extract fingerprints from a CSV file
In this section you’ll learn how to use csv2fps to extract fingerprints from a CSV file.
Instead of storing a molecule, the CSV file might contain fingerprint data, represented in one of various encodings like hex or base64.
If this is the case, use --fingerprint-column (or --fp-column
or --fp-col for short) to indicate that the CSV file contains
fingerprint data, and to indicate which column contains that data.
csv2fps supports the same decoder options as sdf2fps.
The default, --hex, treats the fingerprint as hex-encoded using
chemfp’s mixed-endian order:
% printf "id,fp\nXYZ,1A3B0E\n" | chemfp csv2fps --fp-col 2 --no-date
#FPS1
#num_bits=24
1a3b0e XYZ
% printf "id,fp\nXYZ,1A3B0E\n" | chemfp csv2fps --fp-col 2 --hex --no-date
#FPS1
#num_bits=24
1a3b0e XYZ
The “hex-msb” and “hex-lsb” use most-significant and least-significant bit orders, respectively, which are converted to the chemfp ordering.
% printf "id,fp\nXYZ,1A3B0E\n" | chemfp csv2fps --fp-col 2 --hex-msb --no-metadata
0e3b1a XYZ
% printf "id,fp\nXYZ,1A3B0E\n" | chemfp csv2fps --fp-col 2 --hex-lsb --no-metadata
58dc70 XYZ
Use chemfp csv2fps --help for the full list of fingerprint decoder options.
k-nearest neighbor search
In this section you’ll learn how to search a fingerprint file to find
the k-nearest neighbors. You will need the FPS fingerprint files
generated in Generate fingerprint files from PubChem SD tags but you do not need a
chemistry toolkit. (See Similarity search of ChEMBL by id for an
example of k-nearest search using the chemfp.simsearch() API
function.)
We’ll use the pubchem_queries.fps as the queries for a k=2 nearest neighor similarity search of the target file puchem_targets.gps:
simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps
That’s all! You should get output which starts:
#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2 99000039 48503376 0.878453 48503380 0.872928
2 99000230 48563034 0.858824 48731730 0.852273
2 99001517 48675145 0.569620 48662654 0.554217
2 99002251 48798046 0.810976 48625236 0.810651
2 99003537 48997075 0.903553 48997697 0.898477
2 99003538 48997075 0.903553 48997697 0.898477
Here’s how to interpret the output. The lines starting with ‘#’ are header lines. They contain metadata information describing the similarity search. You can see the search parameters, the name of the tool which did the search, and the filenames which went into the search.
After the ‘#’ header lines come the search results, with one query result per line. There are in the same order as the query fingerprints. Each result line contains tab-delimited columns. The first column is the number of hits. The second column is the query identifier used. The remaining columns contain the hit data, with alternating target id and its score.
For example, the first result line contains the 2 hits for the query 99000039. The first hit is the target id 48503376 with score 0.878453 and the second hit is 48503380 with score 0.872928. Since this is a k-nearest neighor search, the hits are sorted by score, starting with the highest score. Do be aware that ties are broken arbitrarily. There may be additional hits with the score 0.8728 which are not reported.
simsearch CSV output
In this section you’ll learn how to change the simsearch output format to CSV or TSV. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.
The default simsearch output format is unique to chemfp, and therefore
not so easy for other tools to parse directly. Use the --out option
to change the output format to “csv” or “tsv” formats:
% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out csv | head -8
query_id,target_id,score
99000039,48503376,0.878453
99000039,48503380,0.872928
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,48675145,0.569620
99001517,48662654,0.554217
99002251,48798046,0.810976
% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out tsv | head -8
query_id target_id score
99000039 48503376 0.878453
99000039 48503380 0.872928
99000230 48563034 0.858824
99000230 48731730 0.852273
99001517 48675145 0.569620
99001517 48662654 0.554217
99002251 48798046 0.810976
These alternatives have one line for each hit, and no metadata.
NOTE: There are many variants of CSV and TSV output, especially with how to handle spaces and embedded commas and tabs. Chemfp uses the “excel” and “excel-tab” formats in Python’s csv module.
Threshold search
In this section you’ll learn how to search a fingerprint file to find
all of the neighbors at or above a given threshold. You will need the
FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you
do not need a chemistry toolkit. (See
Similarity search of ChEMBL using a SMILES for an example of threshold
search using the chemfp.simsearch() API function.)
Let’s do a threshold search and find all hits which are at least 0.85 similar to the queries:
simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps
The first 15 lines of output from this are:
#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.85
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
4 99000039 48732162 0.859551 48503380 0.872928 48503376 0.878453 48520532 0.854054
2 99000230 48563034 0.858824 48731730 0.852273
0 99001517
0 99002251
4 99003537 48566113 0.872449 48998000 0.853535 48997697 0.898477 48997075 0.903553
4 99003538 48566113 0.872449 48998000 0.853535 48997697 0.898477 48997075 0.903553
0 99005028
Take a look at the first result line, which contains the 4 hits for the query id 99000039. As before, the hit information alternates between the target ids and the target scores, but unlike the k-nearest search, the hits are not in a particular order. You can see that here where the scores are 0.859551, 0.872928, 0.878453, and 0.854054.
You might be wondering why I chose the 0.85 threshold. Quite simply, it was for presentation. With a threshold of 0.8, the first record has 43 hits, which requires rather an overwhelming 87 columns to show.
simsearch CSV output when no hits
In this section you’ll learn how to change the simsearch csv output behavior when a query has no hits. You will need the FPS fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.
The CSV output format writes one output line for each query hit. What happens if a query has no hits? The previous section shows queries 99001517 and 99002251 have no hits with a threshold of 0.85, so let’s see what happens:
% simsearch --threshold 0.85 -q pubchem_queries.fps \
pubchem_targets.fps --out csv | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,*,NaN
99002251,*,NaN
99003537,48566113,0.872449
You can see the queries with no hits get a synthetic output, by default with the target id “*” and the score of “NaN”. This makes it possible to identify queries with no hits.
Use --empty-target-id and --empty-score to change these:
% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps \
--out csv --empty-target-id MISSING --empty-score N/A | head
query_id,target_id,score
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,MISSING,N/A
99002251,MISSING,N/A
99003537,48566113,0.872449
Alternatively, use --no-include-empty to not generate an output
line when there are not hits:
% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps \
--out csv --no-include-empty | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99003537,48566113,0.872449
99003537,48998000,0.853535
99003537,48997697,0.898477
Combined k-nearest and threshold search
In this section you’ll learn how to search a fingerprint file to find the k-nearest neighbors, where all of the hits must be at or above given threshold. You will need the fingerprint files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.
You can combine the -k and --threshold queries to find the
k-nearest neighbors which are all at or above a given threshold:
simsearch -k 2 --threshold 0.7 -q pubchem_queries.fps pubchem_targets.fps
This finds the nearest 2 structures, which all must be at least 0.7 similar to the query fingerprint. The output from the above starts:
#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.7
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2 99000039 48503376 0.878453 48503380 0.872928
2 99000230 48563034 0.858824 48731730 0.852273
0 99001517
2 99002251 48798046 0.810976 48625236 0.810651
2 99003537 48997075 0.903553 48997697 0.898477
2 99003538 48997075 0.903553 48997697 0.898477
2 99005028 48651160 0.828829 48848576 0.816667
2 99005031 48651160 0.828829 48848576 0.816667
2 99006292 48945841 0.965217 48737522 0.879310
2 99006293 48945841 0.965217 48737522 0.879310
0 99006597
2 99006753 48655580 0.931034 48662591 0.924855
Because this is a k-nearest search, the hits are sorted from highest score to lowest.
Saving simsearch results to “npz” format
In this section you’ll learn how to save the simsearch results to SciPy’s “npz” sparse binary format, instead of the default text-based output or the text-based CSV/TSV output.
As you saw earlier, the default simsearch output is in the text-oriented “simsearch” format, which looks like:
#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.85
#software=chemfp/4.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
4 99000039 48732162 0.859551 48503380 0.872928 48503376 0.878453 48520532 0.854054
2 99000230 48563034 0.858824 48731730 0.852273
0 99001517
0 99002251
This is a sparse matrix representation, because most values are below the threshold and therefore do not need to be included. The query id 99000039 only has 4 matches, even though pubchem_targets.fps has 14,753 fingerprints, because the remaining 14,749 fingerprints have a score of under 0.85.
While it’s easy to write a parser for this format, it’s even easier to use an existing parser, like the CSV or TSV output option.
However, that’s still text-oriented, which takes a while to parse, and it loses some precision (but not accuracy) because it tries to minimize the number of digits needed to distinguish between two different realizable scores.
Another option is to save the result in “npz” format, following the SciPy conventions for storing a sparse array. This is as simple as using a filename with the “.npz” extension:
% simsearch --threshold 0.85 -q pubchem_queries.fps \
pubchem_targets.fps -o pubchem_matches.npz
queries: 100%|███████████| 10969/10969 [00:00<00:00, 42721.52 fps/s]
This can be loaded into a SearchResults using
chemfp.search.load_npz():
>>> import chemfp.search
>>> matches = chemfp.search.load_npz("pubchem_matches.npz")
>>> matches
SearchResults(#queries=10969, #targets=14753)
>>> matches.query_ids[:3]
array(['99116624', '99116625', '99116667'], dtype='<U8')
>>> matches.target_ids[:3]
array(['48942244', '48941399', '48940284'], dtype='<U8')
>>> matches.shape
(10969, 14753)
>>> matches[1048]
SearchResult(#hits=18)
>>> matches[1048].query_id
'99144558'
>>> matches[1048].target_ids
array(['48942244', '48941399', '48940284', ..., '48838025', '48730854',
'48985180'], dtype='<U8')
which in turn can be converted into a SciPy “csr” (compressed sparse row matrix):
>>> matches.to_csr()
<10969x14753 sparse matrix of type '<class 'numpy.float64'>'
with 69823 stored elements in Compressed Sparse Row format>
Alternatively, you can use SciPy to load the csr array directly:
>>> import scipy.sparse
>>> csr = scipy.sparse.load_npz("pubchem_matches.npz")
>>> csr
<10969x14753 sparse matrix of type '<class 'numpy.float64'>'
with 69823 stored elements in Compressed Sparse Row format>
This SciPy option doesn’t give you access to the query or target ids, or the metadata about the query. Chemfp saves that information to the npz file, which when you get down to it is a zip file containing NumPy “npy” file with names that follow a common convention, which means it could (if you want to be really low-level) be accessed with the zipfile module:
>>> import zipfile
>>> z = zipfile.ZipFile("pubchem_matches.npz")
>>> for entry in z.infolist():
... print(f"{entry.filename!r}, uncompressed size={entry.file_size}")
...
'chemfp.npy', uncompressed size=664
'indices.npy', uncompressed size=279420
'indptr.npy', uncompressed size=44008
'format.npy', uncompressed size=140
'shape.npy', uncompressed size=144
'data.npy', uncompressed size=558712
'query_ids.npy', uncompressed size=351136
'target_ids.npy', uncompressed size=472224
or with NumPy’s “load” function, which uses a dictionary-like interface for working with npz files:
>>> import numpy as np
>>> npz = np.load("pubchem_matches.npz")
>>> npz
NpzFile 'pubchem_matches.npz' with keys: chemfp, indices, indptr, format, shape...
>>> npz["target_ids"]
array(['48942244', '48941399', '48940284', ..., '48838025', '48730854',
'48985180'], dtype='<U8')
>>> import pprint, json
>>> pprint.pprint(json.loads(str(npz["chemfp"])))
{'alpha': 1.0,
'beta': 1.0,
'format': 'multiple',
'matrix_type': 'NxM',
'method': 'Tversky',
'num_bits': 881,
'shape': [10969, 14753]}
NxN (self-similar) searches
In this section you’ll learn how to use the same fingerprints as both the queries and targets, that is, a self-similarity search. You will need the pubchem_queries.fps fingerprint file generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.
Use the --NxN option if you want to use the same set of
fingerprints as both the queries and targets. Using the
pubchem_queries.fps from the previous sections:
simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps
This code is very fast because there are so few fingerprints. For
larger files the --NxN will be about twice as fast and use half as
much memory compared to:
simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps
In addition, the --NxN option excludes matching a fingerprint to
itself (the diagonal term).
Using a toolkit to process the ChEBI dataset
In this section you’ll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, RDKit and CDK. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.)
We’ll work with data from ChEBI, which are “Chemical Entities of Biological Interest”. They distribute their structures in several formats, including as an SD file. For this section, download the “lite” version from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 226 the file contains 187,935 records and the compressed file is 65M.
ChEBI record titles don’t contain the id
Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 101,902 of the title lines are empty, 39,459 have the title “null”, 4,236 have the title “ “ (with a single space), 1,940 have the title “ChEBI”, 56 of them are labeled “Structure #1”, and the others are usually compound names like “fluoresone” and “bkas#30-CoA(4-)”, or identifiers like “Compound 92” or “145453870”.
For more than a decade I’ve asked ChEBI to fix this, and include the identifer in the title line, to no success. Perhaps you have more influence?
Instead, the record id is stored as value of the “ChEBI ID” tag, which looks like:
> <ChEBI ID>
CHEBI:90
By default the toolkit-based fingerprint generation tools use the title as the identifier, and print a warning and skip the record if the identifier is missing. Here’s an example with rdkit2fps:
% rdkit2fps ChEBI_lite.sdf.gz
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
Error: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
... keeps on going ...
Error: Empty title in SD record after cleanup, file 'ChEBI_lite.sdf.gz', line 2261, record #34: first line is ' '. Skipping.
... keeps on going ...
These “Missing title” messages come from records where no identifier
could be found and the “after cleanup” messages come after chemfp
removes leading and trailing whitespace. The default is to report the
problem to stderr, skip processing the record, and continue on to the
next record. Use the --errors option to change the default
behavior.
(If the first 100 records have no identifiers then the command-line
tools will exit even if --errors is ignore. This is a safety
mechanism. Let me know if it’s a problem.)
The --id-tag option
If the identifier isn’t in the title line but is in one of the SD data
items, then use --id-tag option to specify of the name of the data
tag containing the id. For this data set you’ll need to write it as:
--id-tag "ChEBI ID"
The quotes are important because of the space in the tag name.
Here’s what that looks like:
% rdkit2fps ChEBI_lite.sdf.gz --id-tag "ChEBI ID" | head -8 | fold
#FPS1
#num_bits=2048
#type=RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7
#software=RDKit/2023.09.5 chemfp/4.2
#source=ChEBI_lite.sdf.gz
#date=2024-06-19T09:33:51+00:00
10208220141258c184490038b4124609db0030024a0765883c62c9e1288a1dc224de62f445743b8b
30ad542718468104d521a214227b29ba3822fbf20e15491802a051532cd10d902c39b02b51648981
9c87eb41142811026d510a890a711cb02f2090ddacd990c5240cc282090640103d0a0a8b460184f5
11114e2a8060200804529804532313bb03912d5e2857a6028960189e370100052c63474748a1c000
8079f49c484ca04c0d0bcb2c64b72401042a1f82002b097e852830e5898302021a1203e412064814
a598741c014e9210bc30ab180f0162029d4c446aa01c34850071e4ff037a60e732fd85014344f82a
344aa98398654481b003a84f201f518f CHEBI:90
00000000080200412008000008000004000010100022008000400002000020100020006000800001
01000100080001000010000002002200000200000008000000400002100000000080000004401000
80200020800200002000001400022064000004244810000000000080000a80012002020004198002
00080200020020120040203001000802010100024211000004400000000100200003000001000100
0100021000a200601080002a00002020048004030000884084000008000002040200010800000000
2000010022000800002000020001400020800100025040000000200a080244000060008000000802
8100c801108000000041c00200800002 CHEBI:165
In addition to “ChEBI ID” there’s also a “ChEBI Name” tag which includes data values like “tropic acid” and “(+)-guaia-6,9-diene”. Every ChEBI record has a unique name so the names could also be used as the primary identifier instead of its id.
To use the ChEBI Name as the primary chemfp identifier, specify:
--id-tag "ChEBI Name"
The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other bytes, so it’s no problem using those names directly.
Generate fingerprints with Open Babel
If you have the Open Babel Python library installed then you can use
ob2fps to generate fingerprints. (See
chemfp.ob2fps() for the equivalent Python function.)
ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
This takes about 3.5 minutes on my 2020-era M1 MacBook Pro laptop to process all of the records, and generates messages like:
==============================
*** Open Babel Warning in Expand
Alias R was not chemically interpreted
==============================
*** Open Babel Warning in ReadMolecule
WARNING: Problem interpreting the valence field of an atom
The valence field specifies a valence 3 that is
less than the observed explicit valence 4.
==============================
*** Open Babel Warning in ReadMolecule
Failed to kekulize aromatic bonds in MOL file
==============================
*** Open Babel Warning in ReadMolecule
Invalid line: M RGP must only refer to pseudoatoms
M RGP 2 12 1 15 2
The default generates FP2 fingerprints, so the above is the same as:
ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
ob2fps can generate several other types of fingerprints. (Use ob2fps --help for a list.) For example, to generate the Open Babel implementation of the MACCS definition specify:
ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
By default ob2fps shows a progress bar which looks like:
ChEBI_lite.sdf.gz: 90365 recs [02:21, 507.25 recs/s]
Use --no-progress to not use a progress bar.
The underlying Open Babel toolkit does not have a way to get the current location in the file, so the progress bar is only able to show the number of records processed, and not the percentage complete.
ob2fps has an alternative implementation which uses chemfp’s text
toolkit to parse each record as a string, which is then passed to Open
Babel. The “chemfp” implementation is able to get the current file
position, letting ob2fps show a percentage complete progress bar. Use
the -R option to set the “implementation” reader argument to
“chemfp”, as in the following:
ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi2.fps \
-R implementation=chemfp
which shows:
ChEBI_lite.sdf.gz: 53%|█████▌ | 36.0M/67.8M [01:58<01:27, 362kbytes/s]
The chemfp implementation took few seconds longer than the native Open Babel implementation.
Generate fingerprints with OpenEye
If you have the OEChem Python library installed, with licenses for
OEChem and OEGraphSim, then you can use oe2fps to
generate fingerprints. (See chemfp.oe2fps() for the
equivalent Python function.)
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps
This takes about 1 minute on my laptop and generates a number of warnings like “Stereochemistry corrected on atom number 17 of”, “Unsupported Sgroup information ignored”, and “Invalid stereochemistry specified for atom number 9 of”. Normally the record title comes after the “… of”, but the title is blank for most of the records.
As an historical note, old ChEBI releases include empty structure records, like CHEBI:147324. By default OEChem’s SDF reader skips empty structure records. If you really need those records, add the SuppressEmptyMolSkip flag to the default ‘flavor’ reader argument, like this:
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps \
-R flavor=Default,SuppressEmptyMolSkip
The default settings generate OEGraphSim path fingerprint with the values:
numbits=4096 minbonds=0 maxbonds=5
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
Each of these can be changed through command-line options. Use oe2fps --help for details.
oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition specify:
oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-o chebi_maccs.fps
Use oe2fps --help for a list of available oe2fps fingerprints or to see more configuration details.
By default oe2fps shows a progress bar which looks like:
ChEBI_lite.sdf.gz: 24%|██ | 13.0M/53.6M [00:14<00:50, 808kbytes/s]
Use --no-progress to not use a progress bar.
Generate fingerprints with RDKit
If you have the RDKit Python library installed then you can use
rdkit2fps to generate fingerprints. (See
chemfp.rdkit2fps() for the equivalent Python function.)
Based on the previous examples you probably guessed that the command-line is:
rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps
This takes about 4.4 minutes on my laptop, and RDKit did not generate fingerprints for 131 of the 187,935 structures. RDKit logs warning and error messages to stderr. They look like:
[14:06:39] WARNING: not removing hydrogen atom without neighbors
[14:06:39] Explicit valence for atom # 7 O, 3, is greater than permitted
[14:06:40] Explicit valence for atom # 0 He greater than permitted
[14:06:08]
****
Post-condition Violation
Element 'hv' not found
Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2021_09_4/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[14:06:08] Element 'hv' not found
For example, RDKit is careful to check that structures make chemical sense. It rejects 3-valent oxygens and refuses to process that those structures, which is the reason for the first line of that output.
The default generates RDKit’s path fingerprints with parameters:
minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
Each of those can be changed through command-line options. See rdkit2fps rdkit2fps --help for details, where you’ll also see a list of the other available fingerprint types.
For example, to generate the RDKit implementation of the MACCS definition use:
rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-o chebi_maccs.fps
while the following generates the Morgan/circular fingerprint with radius 3:
rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz
By default rdkit2fps shows a progress bar which looks like:
ChEBI_lite.sdf.gz: 53%|████ | 35.8M/67.8M [02:18<02:04, 257kbytes/s]
Use --no-progress to not use a progress bar.
Generate fingerprints with CDK
If you have the CDK Java JAR file on your CLASSPATH and you have
installed the JPype1 package (see the installation guide for help) then you can use cdk2fps to
generate fingerprints. (See chemfp.cdk2fps() for the equivalent
Python function.)
Based on the previous examples you probably guessed that the command-line is:
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps
However, that would be incomplete. The CDK fingerprint type is “CDK-Daylight”, which generates Daylight-like fingerprints, and the documentation says that all hydrogens must be explicit.
This requires the -R option, to set the CDK reader argument
“hydrogens” to “make-explicit”, which converts all implicit hydrogens
to explicit, before generating the fingerprints:
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-R hydrogens=make-explicit -o cdk_chebi.fps
(Use cdk2fps --help to see which fingerprint types need explicit hydrogens and which need implicit hydrogens.)
This takes about 3.4 minutes on my laptop, of which 1.6 seconds is spent in making explicit hydrogens (Without the reader argument it only takes 1.8 minutes), and CDK did not generate fingerprints for 14 of the 187,935 structures. CDK generated various warning and error messages, including:
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR:
Error while reading next molecule: Atom is not a member of this AtomContainer
org.openscience.cdk.config.IsotopeFactory ERROR:
Could not find major isotope for: 88
Chemfp lets you choose an alternate error handler (see the next section), which can help you
figure out which structures could not be processed. I’ll enable the
report error handler:
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps \
--errors report
This generates 11 lines of the form:
Error: Cannot generate fingerprint: java.lang.NullPointerException,
file 'ChEBI_lite.sdf.gz', record #8683. Skipping.
That means the record caused the CDK fingerprinter function to fail, by raising a Java NullPointerException, which chemfp catches and reports. For reference, that record is ChEBI ID CHEBI:5015. I reported it (in June 2024), and it was quickly fixed.
It’s a bit tricky to figure out that record #8683 is CHEBI:5015 because the record’s initial line number isn’t shown, in turn because it isn’t available from the CDK API. cdk2fps has an alternative implementation which uses chemfp’s text toolkit to parse each record as a string, which is then passed to the CDK. The “chemfp” implementation is able to report the current line number. It is enabled with the cdk.sdf.implementation value “chemfp”, like this:
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi2.fps \
--errors report -R hydrogens=make-explicit -R implementation=chemfp
This adds line number information to the report:
Error: Cannot generate fingerprint: java.lang.NullPointerException,
file 'ChEBI_lite.sdf.gz', line 556565, record #8683. Skipping.
This variant implementation took 3.17 minutes (1.7 minutes without making explicit hydrogens), so in this case was faster than CDK’s own SDF parser. It used to 50% slower (without adding hydrogens), so I don’t know what’s happened. In any case, let me know if you find it useful.
The default cdk2fps fingerprint type is CDK-Daylight with parameters:
size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0
Each of those can be changed through command-line options. See cdk2fps --help for details, where you’ll also see a list of the other available fingerprint types.
For example, to generate the CDK implementation of the MACCS definition use:
cdk2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-R hydrogens=make-explicit -o chebi_maccs.fps --errors report
(CDK’s MACCS fingerprints also require explicit hydrogens.)
This generates 29 report lines of the form:
Error: Cannot generate fingerprint: java.lang.NullPointerException:
Aromaticity model requires implicit hydrogen count is set., file
'ChEBI_lite.sdf.gz', record #2935. Skipping.
By default cdk2fps shows a progress bar which looks like:
ChEBI_lite.sdf.gz: 25%|███ | 17.2M/67.8M [00:48<02:11, 384kbytes/s]
Use --no-progress to not use a progress bar.
Use structures as input to simsearch
In this section you’ll learn how to use structures as queries to
simsearch queries instead of fingerprints. You’ll need a chemistry
toolkit and a fingerprint file generated from that toolkit. This
section assumes you have one of the chebi_maccs.fps ChEBI
fingerprint files generated in the previous section. My example will
use the RDKit-generated file created using:
rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-o chebi_maccs.fps
I’ll search the ChEBI data set for phenol with the SMILES “c1ccccc1O”.
My chebi_maccs.fps starts:
% head -3 chebi_maccs.fps
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
The type information gives a hint of how to generate a fingerprint query for that data set, which you can do manually using rdkit2fps:
% echo "c1ccccc1O phenol" | rdkit2fps --maccs166
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#date=2024-06-19T12:04:30
00000000000000000000000000000140004480101e phenol
I could then use a pipe to pass the rdkit2fps output as input to simsearch:
% echo "c1ccccc1O phenol" | rdkit2fps --maccs166 | \
simsearch chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
That’s a bit clumsy, because I have to look at the fingerprint file and figure out which command-line tools and options to use.
A simpler way is to pass the structures directly to simsearch, either
as a command-line option or from an input file. In the following I’ll
pass in the SMILES using the `--query` parameter:
% simsearch chebi_maccs.fps --threshold 1.0 --query "c1ccccc1O phenol"
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
How does it work? Chemfp opens the targets targets file to read the metadata section. It then uses the type string to figure out how to generate the fingerprints for that type, as well as figure out which toolkit to use for structure processing.
If the query structure is not a SMILES String then use
--query-format to specify the format name. Use --query-id to
specify the query id instead of using the id from the input record.
For example, the following uses the InChI for proline as the input, sets the query id to “proline”, and finds the two nearest neighbors in ChEBI:
% simsearch chebi_maccs.fps --query-format inchi --query \
'InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H2,(H,7,8)/t4-/m0/s1' \
--query-id proline -k 2
#Simsearch/1
#num_bits=166
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
2 proline CHEBI:17203 1.00000 CHEBI:16313 1.00000
CHEBI:17203 is “L-proline” and CHEBI:16313 is “D-proline”.
The --query simsearch option takes the structure on the
command-line. Use --queries to read queries from a file, which may
be a fingerprint file or a structure file.
I’ll demonstrate with a SMILES file containing two records:
% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
which I’ll use to search chebi_maccs.fps:
% simsearch --queries simple.smi chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.2
#queries=simple.smi
#targets=chebi_maccs.fps
#query_source=simple.smi
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
3 caffeine CHEBI:27732 1.00000 CHEBI:178066 1.00000 CHEBI:190027 1.00000
See simsearch CSV output to learn how to have simsearch generate CSV or TSV output.
By default simsearch uses the filename to figure out the format type
and compression. Use --query-format to specify a different
format. For example, if neither --query nor --queries are
specified then the default reads FPS queries from stdin. I’ll use
--query-format sdf.gz to have it read gzip-compressed SD records
from stdin, in this case from a PubChem file:
% cat Compound_099000001_099500000.sdf.gz | \
chemfp simsearch --query-format sdf.gz chebi_maccs.fps -k 1 | head -10
#Simsearch/1
#num_bits=166
#type=Tanimoto k=1 threshold=0.0
#software=chemfp/4.2
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 99000039 CHEBI:220837 0.87755
1 99000230 CHEBI:120636 0.84146
1 99001517 CHEBI:134851 0.74419
1 99002251 CHEBI:218232 0.79630
Make new fingerprints matching the type in an existing file
In this section you’ll learn how to generate fingerprints that match
the fingerprint type of an existing file. You’ll need a chemistry
toolkit and a fingerprint file generated from that toolkit. This
section assumes you have one of the chebi_maccs.fps ChEBI
fingerprint files generated in an earlier section.
In the previous section you learned how to use structures as input to simsearch; simsearch uses the target fingerprint file metadata to generate the appropriate fingerprints for each structure. What if you want to generate fingerprints appropriate for the target data set but don’t immediately want to use them for a search?
The --using FILENAME option to cdk2fps, oe2fps, rdkit2fps, and
ob2fps opens the named fingerprint file to get the appropriate
type. I’ll demonstrate with a simple SMILES file, where the
fingerprint types comes from chebi_maccs.fps:
% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
% rdkit2fps simple.smi --using chebi_maccs.fps
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:10:51
00000000000000000000000000000140004480101e phenol
000000003000000001d414d91323915380f138ea1f caffeine
How do you know to use rdkit22fps instead of one of the other programs? You don’t, but all of the chemfp programs will try to process fingerprint types from other toolkits:
% ob2fps simple.smi --using chebi_maccs.fps
WARNING: ob2fps uses the openbabel toolkit but the --using file 'chebi_maccs.fps' specifies 'RDKit-MACCS166/2' which uses the rdkit toolkit.
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2021.09.4 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:11:12
00000000000000000000000000000140004480101e phenol
000000003000000001d414d91323915380f138ea1f caffeine
This cross-toolkit functionality is part of the long-term chemfp design.
If you know the chemfp fingerprint type string then you could pass
that in on the command-line via the --type option, as in:
% ob2fps --type "OpenBabel-ECFP2 nBits=128" simple.smi
#FPS1
#num_bits=128
#type=OpenBabel-ECFP2/1 nBits=128
#software=OpenBabel/3.1.0 chemfp/4.2
#source=simple.smi
#date=2024-06-19T12:12:55
00080020000a00000c00000000000000 phenol
000c4448000000880404000221002040 caffeine
Alternate error handlers
In this section you’ll learn how to change the error handler for
rdkit2fps using the --errors option.
By default the “<toolkit>2fps” programs “ignore” structures which could not be parsed into a molecule option. There are two other options. They can “report” more information about the failure case and keep on processing, or they can be “strict” and exit after reporting the error.
This is configured with the --errors option.
Here’s the rdkit2fps output using --errors report:
[12:21:03] WARNING: not removing hydrogen atom without neighbors
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Skipping.
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24338, record #381. Skipping.
The first two lines come from RDKit. The third line is from chemfp, reporting which record could not be parsed. (The record starts at line 24228 of the file.) The fourth line is another RDKit error message, and the last line is another chemfp error message.
Here’s the rdkit2fps output using --errors strict:
[12:24:24] WARNING: not removing hydrogen atom without neighbors
[12:24:24] Explicit valence for atom # 12 N, 4, is greater than permitted
Error: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Exiting.
Because this is strict mode, processing exits at the first failure.
The ob2fps, cdk2fps and oe2fps tools implement the --errors
option, but they aren’t as useful as rdkit2fps because the underlying
APIs don’t give useful feedback to chemfp about which records
failed. For example, the standard OEChem file reader automatically
skips records that it cannot parse. Chemfp can’t report anything when
it doesn’t know there was a failure.
The ob2fps and cdk2fps readers have an alternate SDF reader implementation, enabled when the “implementation” reader argument is “chemfp”, which uses chemfp’s own SDF record reader to read the records, then passes the result to the toolkit for further processing. The chemfp implementation supports somewhat better error reporting than the native toolkits. There is no equivalent for OEChem because chemfp’s record reader is so slow by comparison.
The default error handler in chemfp 1.1 was “strict”. In practice this proved more annoying than useful because most people want to skip the records which could not be processed. They would then contact me asking what was wrong, or doing some pre-processing to remove the failure cases.
One of the few times when it is useful is for records which contain no identifiers. When I changed the default from “strict” to “ignore” and tried to process ChEBI, I was confused at first about why the output file was so small. Then I realized that it’s because the many records without a title were skipped, and there was no feedback about skipping those records.
I changed the code so missing identifiers are always reported, even if the error setting is “ignore”. Missing identifiers will still stop processing if the error setting is “strict”.
chemfp’s two cross-toolkit substructure fingerprints
In this section you’ll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, RDKit, and CDK. (For OpenEye users, these fingerprints use the base OEChem library but do not use the separately licensed OEGraphSim library.)
chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other comes from the 881-bit PubChem/CACTVS substructure fingerprints.
The chemfp MACCS definition is called “rdmaccs” because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, while OpenEye has a completely independent implementation.)
Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits.
Open Babel:
% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenBabel/2
#software=OpenBabel/3.1.0 chemfp/4.2
#date=2024-06-19T12:18:16
00000000000000000000000000000140004480101e phenol
OpenEye:
% echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenEye/2
#software=OEChem/2.3.0 (20191016) chemfp/3.5
#date=2021-01-27T14:46:03
00000000000000000000000000000140004480101e phenol
RDKit:
% echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-RDKit/2
#software=RDKit/2021.09.4 chemfp/4.2
#date=2024-06-19T12:18:40
00000000000000000000000000000140004480101e phenol
CDK:
% echo "c1ccccc1O phenol" | cdk2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-CDK/2
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:19:56
00000000000000000000000000000140004480101e phenol
For more complex molecules it’s possible that different toolkits produce different fingerprint rdmaccs, even though the toolkits use the same SMARTS definitions. Each toolkit has a different understanding of chemistry. The most notable is the different definition of aromaticity, so the bit for “two or more aromatic rings” will be toolkit dependent.
There should be no reason to convert all hydrogens to explicit when using CDK’s rdmaccs implementation.
substruct fingerprints
chemp also includes a “substruct” substructure fingerprint. This is an
881 bit fingerprint derived from the PubChem/CACTVS substructure
keys. They do not match the CACTVS fingerprints exactly, in part due
to differences in ring perception. Some of the substruct bits will
always be 0. With that caution in mind, if you want to try them out,
use the --substruct option.
The term “substruct” is a horribly generic name. If you can think of a better one then let me know. Until chemfp 3.0 I said these fingerprints were “experimental”, in that I hadn’t fully validated them against PubChem/CACTVS and could not tell you the error rate. I still haven’t done that.
What’s changed is that I’ve found out over the years that people are using the substruct fingerprints, even without full validatation. That surprised me, but use is its own form of validation. I still would like to validate the fingerprints, but it’s slow, tedious work which I am not really interested in doing. Nor does it earn me any money. Plus, if the validation does lead to any changes, it’s easy to simply change the version number.
Here is an example using the CDK:
% echo "c1ccccc1O phenol" | cdk2fps --in smi --substruct | fold
#FPS1
#num_bits=881
#type=ChemFP-Substruct-CDK/1
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:22:43
010604000000000000000000000000000000000000000c0000000000000000800000005800001000
0010200109000c6001004000010004420000400000040400101100601011106444418848010e0024
03881019e00102000000000000000000000000000000000000000000000000 phenol
Change the “cdk2fps” as appropriate for the other toolkits.
The CDK “substruct” fingerprint should not require explicit hydrogens. If you are using CDK 2.9 or later then consider using the toolkit’s own PubChem fingerprinter, which should be a better fit to the CACTVS ring perception used in the PubChem fingerprints:
% echo "c1ccccc1O phenol" | cdk2fps --in smi --substruct | fold
#FPS1
#num_bits=881
#type=CDK-Pubchem/2.9
#software=CDK/2.9 chemfp/4.2
#date=2024-06-19T12:25:58
010604000000000000000000000000000000000000000c0000000000000000800000005800001000
0010200109000c6001004000010004420000400000040400101100601011106444418848010e0024
03881019e00102000000000000000000000000000000000000000000000000 phenol
For this simple phenol example, both give the same result.
(If you are using a CDK older than 2.9 then you will need to convert
all hydrogens to explicit before generating --Pubchem
fingerprints.)
Generate binary FPB files from a structure file
In this section you’ll learn how to generate an FPB file instead of an FPS file. You will need the ChEBI file from Using a toolkit to process the ChEBI dataset and a chemistry toolkit. The FPB format was introduced with chemfp-2.0.
Note
Several chemfp features, like creating FPB files, require a valid license key. If you are using chemfp under the Base License Agreement then contact sales@dalkescientific.com to purchase a license key or request an evaluation license.
The FPB format was designed so the fingerprints can be memory-mapped directly to chemfp’s internal data structures. This makes it very fast to load, but unlike the FPS format, it’s not so easy to write with your own code. You should think of the FPB format as an binary application format, for chemfp-based tools, while the FPS format is a text-based format for data exchange between diverse programs.
The easiest way to generate an FPB file from the command line is to use the “.fpb” extension instead of “.fps” or “.fps.gz”. Here are examples using each of the toolkits.
Open Babel:
ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb
OpenEye:
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb
RDKit:
rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb
CDK:
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fpb
The binary format isn’t human-readable. Use fpcat to see what’s inside:
% fpcat oe_chebi.fpb
#FPS1
#num_bits=4096
#type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.4.3 (20191016) chemfp/3.4
0000000 ... many zeros ...00000000000000 CHEBI:15378
0000000 ... many zeros ...00000000000000 CHEBI:16042
0000000 ... many zeros ...00000000000000 CHEBI:17792
....
182b528 ... many hex values ... a8c10c0c CHEBI:60493
By default the fingerprints are ordered from smallest popcount to largest, which you can see in the output. A pre-ordered index is faster to search because the target popcounts are pre-computed and because it often reduces the search space.
If you want to preserve the input order then you’ll need to pipe the
FPS output to fpcat and use its --preserve-order
flag. See the next section for an example.
Convert between FPS and FPB formats
In this section you’ll learn how to convert an FPS file into an FPB file and back, and you’ll learn how to control the fingerprint ordering. You will need the FPS files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit. The FPB format was introduced with chemfp-2.0.
If you already have an FPS file then you can convert it directly into an FPB file, and without using a chemistry toolkit. The fpcat program converts from one format to the other.
In an earlier section I generated the files pubchem_queries.fps and pubchem_targets.fps . I’ll convert each to FPB format:
% fpcat pubchem_targets.fps -o pubchem_targets.fpb
% fpcat pubchem_queries.fps -o pubchem_queries.fpb
The FPB format is a binary format which is difficult to read directly. The easiest way to see what’s inside is to use fpcat. If you don’t specify an output filename then it sends the results to stdout in FPS format:
% fpcat pubchem_queries.fpb | head -5 | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
00028000e00000000000000000000000000000000000000000000000000000000000009840000000
0000c001000300000000000000000000000000000000000000000200000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000 99116624
The keen-eyed reader might have noticed that the conversion does not have a “source” or “date” field. I haven’t figured out if this is a bug. Should I keep the original date and structure file source, or use the current date and FPS file source? Let me know if this is important to you.
By default when fpcat generates an FPB file it reorders the fingerprints by population count and creates a popcount index. This improves the similarity search performance, but it means that the order of the FPB file is likely different than the original FPS format. You can get a sense of this by looking at the first fingerprint in the original pubchem_queries.fps file:
% grep -v # pubchem_queries.fps | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000 99000039
and confirming that it isn’t the same as the first fingerprint in pubchem_queries.fpb.
If you want the FPB file to store the fingerprints in input order
instead of the popcount order needed for optimized similarity search,
then use the --preserve-order flag:
% fpcat pubchem_queries.fps --preserve-order -o input_order.fpb
% fpcat input_order.fpb | grep -v # | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000 99000039
On the flip side, fpcat by default preserves the input order when it
creates FPS output. If you instead want to the output FPS file to be
in popcount order then use the --reorder flag:
% fpcat --reorder pubchem_queries.fps | grep -v # | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000 99000039
Specify the fpcat output format
In this section you’ll learn how to specify the output format for fpcat using a command-line option instead of the filename extension. You will need the pubchem_queries.fpb file from Generate fingerprint files from PubChem SD tags.
If you do not specify an output filename then fpcat will output the fingerprints in FPS format to stdout. If you specify a filename then by default it will look at the extension to determine if the output should be an FPB (“.fpb”), FPS (“.fps”), or gzip or Zstandard compressed FPS (“.fps.gz” or “.fps.zst”) file. The FPS format is used for unrecognized extensions.
In a few rare cases you may want to use a format which doesn’t match the default. To be honest, the examples I can think of aren’t that realistic, but let’s suppose you want to output the contents of an FPB file to stdout in gzip’ed FPS format, and count the number of bytes in compressed output. I’ll use the use the –out flag to change the format to ‘fps.gz’ from the default of ‘fps’, then compare the resulting size with the uncompressed form:
% fpcat pubchem_queries.fpb --out fps | wc -c
2544890
% fpcat pubchem_queries.fpb --out fps.gz | wc -c
320828
It’s not that useful because you could pipe the uncompressed output to gzip, which is also likely faster:
% fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c
320819
In case you’re wondering, chemfp 3.4 added support for zstandard compression, if the “zstandard” Python module is available.
% fpcat pubchem_queries.fpb --out fps.zst | wc -c
298283
Chemfp cannot write an FPB file to stdout. In fact, the output file must be seek-able, which means it can’t be a named pipe either.
Alternate fingerprint file formats
In this section you’ll learn about chemfp’s support for other fingerprint file formats.
Chemfp started as a way to promote the FPS file format for fingerprint exchange. Chemfp 2.0 added the FPB format, which is a binary format designed around chemfp’s internal search data structure so it can be loaded quickly.
There are many other fingerprint formats. Perhaps the best known is the Open Babel FastSearch format. Two others are Dave Cosgrove’s flush format, and OpenEye’s “fpbin” format.
The chemfp_converters package contains utilities to convert between the chemfp formats and these other formats.
## Convert from/to Dave Cosgrove Flush format
% flush2fps drugs.flush
% fps2flush drugs.fps -o drugs.flush
## Convert from/to OpenEye's fpbin format
% fpbin2fps drugs.fpbin --moldb drugs.sdf
% fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin
## Convert from/to Open Babel's FastSearch format
% fs2fps drugs.fs --datafile drugs.sdf
% fps2fs drugs_openbabel_FP2.fps --datafile drugs.sdf -o drugs.fs
Of the three formats, the flush format is closest to the FPS data model. That is, it stores fingerprint records as an identifier and the fingerprint bytes. By comparison, the FastSearch and fpbin formats store the fingerprint bytes and an index into another file containing the structure and identifier. It’s impossible for chemfp to get the data it needs without reading both files.
Chemfp has special support for the flush format. If chemfp_converters is installed, chemfp will use it to read and write flush files nearly everywhere that it accepts FPS files. You can use it at the output to oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch, and as both input and output to fpcat. (You cannot use it as the simsearch targets because that code has been optimized for FPS and FPB search, and I haven’t spent the time to optimize flush file support.)
This means that if chemfp_converters is installed then you can use fpcat to convert between FPS, FPB, and and flush file formats. For examples:
% fpcat drugs.flush -o drugs.fps
% fpcat drugs.fps -o drugs.flush
In addition, you can use it at the API level in chemfp.open(),
chemfp.load_fingerprints(),
chemfp.open_fingerprint_writer(), and
FingerprintArena.save().
Note that the flush format does not support the FPS metadata fields, like the fingerprint type, and it only support fingerprints which are a multiple of 32 bits long. Also, compressed flush files are not supported.
The FPB format
In this section you’ll learn about the FPB format.
The FPS format is a human-readable text format. It’s meant to be easy to create software to read or write FPS files so it can be used as a way to exchange fingerprint data between diffferent programs. The downside is it’s relatively slow to process. Chemfp can search about 1M 2048-bit FPS fingerprints in one second, or load about 250K 2048-bit fingerprints/second into memory in the same time.
The FPB format is a binary format which is much faster to load, though internally more complex. It can search 1M fingerprints in about 10 millisecond, and the load time mostly depends on the file system performance.
This makes a big difference in web development, where the web app might restart every time a file is edited, and in command-line tools, where the load time might be far greater than the analysis time.
Note though that the Base License Agreement does not permit people to create FPB files, and the chemfp binary distributions contain a license manager which restricts access to that feature without an authorized license key. See the chemfp licensing page for licensing options and for how to request a license key and/or quote for license.
Internally the FPB file contains an 8-byte signature followed by a set of chunks. Each chunk contains an 8-byte length field followed a 4-byte identifier followed by n bytes of data, where n is the value of the length field. Those familiar with PNG or other FourCC format will find this familiar.
Different chucks contain different types of data. For example, the “META” chunk contains the metadata, and the “AREN” chuck contains the fingerprint data, organized in a way that makes them easy to load into a chemfp arena, hence the name.
Licensed FPB files
While the Base License Agreement does not permit people to create an FPB file, it does allow people to load an FPB file, including FPB files generated by third-party tools.
However, the Base License Agreement also does not generally permit in-memory searches of fingerprint data sets with more than 50,000 fingerprints. If you try you’ll get a message like the following:
A valid chemfp license key is required to search an arena with more than 50,000 fingerprints.
The license key check failed. Neither CHEMFP_LICENSE nor CHEMFP_LICENSE_FILE environment variables are defined.
Email sales@dalkescientific.com to purchase a license key or request a demo license key.
Cannot run simsearch. Exiting.
The exception is if the FPB file is a licensed FPB file.
A licensed FPB file has an embedded license key (stored in the “CFPL” chunk) which, if valid, unlocks all of the license manager restrictions for using that file.
Get licensed FPB files containing ChEMBL 34 fingerprints
In this section you’ll learn how to get the ChEMBL 34 fingerprints as a chemfp licensed FPB file. For background you should read the previous section on the FPB format.
Short version: use the following to download and uncompress the file:
curl -O https://chemfp.com/datasets/chembl_34.fpb.gz
gunzip chembl_34.fpb.gz
If curl isn’t installed, try:
wget https://chemfp.com/datasets/chembl_34.fpb.gz
The result is the licensed FPB file chemfp_34.fpb.
Note
The “licensed” in “licensed FPB file” only refers to the presence of the embedded chemfp license key. The data in the file is distributed under the terms of the ChEMBL license and includes the required attribution.
Long version: The ChEMBL 34 distribution includes pre-computed RDKit Morgan2 fingerprints in FPS format. These can be searched directly, like:
% simsearch --query c1ccccc1CN -k 2 chembl_34.fps.gz --times
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.2
#targets=chembl_34.fps.gz
#target_source=chembl_34.fps.gz
2 Query1 CHEMBL522 1.0000000 CHEMBL14186 0.9333333
open 0.01 read 0.00 search 2.61 output 0.00 total 2.79
or decompressed first, using “gunzip”, to skip the decompression overhead:
% gunzip chembl_34.fps.gz
% simsearch --query c1ccccc1CN -k 2 chembl_34.fps --times --no-metadata
2 Query1 CHEMBL522 1.0000000 CHEMBL14186 0.9333333
open 0.01 read 0.00 search 1.31 output 0.00 total 1.50
The chemfp project also distributes ChEMBL fingerprints which have
been reformatted to the FPB format.
Many of the examples in the documentation will use chembl_34.fpb.
Here’s how to get that file.
Step 1: Download chembl_34.fpb.gz.
Step 2: Decompress it with:
gunzip chembl_34.fpb.gz
Step 3: (optional) View the license terms:
chemfp fpb_text chembl_34.fpb
Similarity search with the FPB format
In this section you’ll learn how to do a similarity search using an FPB file as the target. You will need ChEMBL 34 as a chemfp-licensed FPB file and you will need the RDKit chemistry toolkit.
The file chembl_34.fpb contains the ChEMBL-generated RDKit Morgan
circular fingerprints for ChEMBL 34, reformatted in FPB format, and
containing a license key which unlocks restrictions on using chemfp to
work with that file.
All of the chemfp tools support FPB files as input or output formats, including simsearch. Here’s an example using first FPS format then FPB format (run three times each, reporting only the fastest time:
% time simsearch --query c1ccccc1CN -k 2 chembl_34.fps > /dev/null
2.073u 0.427s 0:01.57 158.5% 0+0k 0+0io 47pf+0w
% time simsearch --query c1ccccc1CN -k 2 chembl_34.fpb > /dev/null
0.560u 0.609s 0:00.34 341.1% 0+0k 0+0io 77pf+0w
These times are reported as “user time” (the CPU time spent by the program), “system time” (the time spent by the operating system, in this case, to read and transfer data from disk), and “wall clock time”, which is the overall elapsed time. In this case, it took 2.1 seconds to search the FPS file and 0.56 seconds to search the FPB file.
That’s a factor of 4, which is pretty good, but perhaps not that impressive.
Performance breakdown
That’s because the total time includes the time needed to load RDKit, parse the query SMILES, and generate the query fingerprint. To start, it takes 0.04 seconds for Python to start working:
% time python -c 'pass'
0.033u 0.009s 0:00.04 75.0% 0+0k 0+0io 0pf+0w
It takes Python about 0.25 seconds to load RDKit:
% time python -c 'from rdkit.Chem import rdMolDescriptors'
0.673u 0.651s 0:00.25 528.0% 0+0k 0+0io 24pf+0w
Finally, chemfp’s wrapper to the RDKit toolkit adds another 0.04 seconds:
% time python -c 'from chemfp import rdkit_toolkit, rdkit_types'
0.838u 0.621s 0:00.29 500.0% 0+0k 0+0io 18pf+0w
Note
The times depend very much on your operating system, the location and type of the file system, the file cache, and more.
To demonstrate this overhead, I’ll pre-compute the fingerprints so chemfp doesn’t need to import RDKit:
% cat benzylamine.smi
c1ccccc1CN benzylamine
% rdkit2fps benzylamine.smi -o benzylamine.fps \
--using chembl_34.fpb --no-progress
% tail -1 benzylamine.fps | fold -w 70
0000000000000000000001000000000000000000000000000000000000000000000000
0000000000000000000000000020000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000200000000000008000000004000001000000
0000000000000800008000000000000000000000000000000000000000000000100000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000500400000000000000000000000000000200002000000000000000
0000000000000000000000 benzylamine
then use the fingerprint file as input:
% time simsearch --queries benzylamine.fps -k 2 chembl_34.fpb > /dev/null
0.091u 0.025s 0:00.16 68.7% 0+0k 0+0io 23pf+0w
The 0.25 seconds of startup time, plus the 0.16 seconds of search time, minus the doubly-counted Python startup time of 0.04 seconds, is 0.33 seconds, which is quite close to the time of 0.34 seconds I measured earlier.
To give a direct comparison, I’ll use the same query file to search the FPS file:
% time simsearch -q benzylamine.fps -k 2 chembl_34.fps > /dev/null
1.285u 0.154s 0:01.60 89.3% 0+0k 0+0io 8pf+0w
FPB search is now about 10 times faster than FPS search. And of that 0.16 seconds, 0.025 seconds or 15% is in file I/O.
Multi-core similarity search
In this section you’ll learn how chemfp parallelizes multiple queries and the various configuration options. You will need ChEMBL 34 as a chemfp-licensed FPB file, the PubChem file Compound_099000001_099500000.sdf.gz, and the RDKit toolkit.
Suppose you want the ChEMBL 34 record which is most common to each record in Compound_099000001_099500000.sdf.gz. This can be done with simsearch as a k=1 nearest-neighbor search, with the PubChem fingerprints as queries, and the ChEMBL file as targets.
This means converting the PubChem structure file into a fingerprint file using the same fingerprint type that ChEMBL uses:
rdkit2fps Compound_099000001_099500000.sdf.gz \
--using chembl_34.fpb -o Compound_099000001_099500000.fps
Then do the search, saving the results to “results.txt”:
time simsearch -k 1 -q Compound_099000001_099500000.fps \
chembl_34.fpb -o results.txt
Note
This will take a couple of minutes to run. While it’s doing that, perhaps take a look at the CPU load.
Simsearch support progress bars, which in this case shows the progress through the input file:
position: 47%|███████▊ | 2.71M/5.73M [00:50<00:55, 54.0k bytes/s]
Once simsearch finished, “time” reported:
704.387u 6.044s 1:49.96 646.0% 0+0k 0+0io 65630pf+0w
The “646.0%” refers to the total CPU time divided by the wall-clock time. It can be greater than 100% when using multiple cores in parallel, which is what happened here.
Converting large data sets to FPB format
In this section you’ll learn how to generate an FPB file on computers with relatively limited memory. To be realistic, this example uses the complete PubChem data set, and extracts the CACTVS/PubChem fingerprints which are in each record. You do not need a chemistry toolkit for this section.
The most direct way to extract the PubChem fingerprints from a PubChem distribution is to use sdf2fps:
sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb
This uses the default FPB writer options, which stores all of the fingerprints in memory, sorts them, and saves the result to the output file. This may use several times as much memory as the final FPB output size, which is a bit unfortunate if you want to generate a 7 GB FPB file on a 12 GB machine.
When I updated this section in June 2020, it took around 25GB of memory to create an FPB file with 102,768,482 PubChem fingerprints, and the final file was about 14GB. (I did not update this section in 2024 for the chemfp 4.2 release.)
(Note: see Generate fingerprints in parallel and merge to FPB format for a two-stage solution that lets you parallelize fingerprint generation.)
The “*2fps” command-line tools do not have a way to change the
default writer options, although fpcat does. The
--max-spool-size option sets a rough upper bound to the amount of
memory to use. When enabled, the writer breaks the input into parts
and creates a temporary FPB file for each part. At the end, it merges
the sorted data from the temporary FPB files to get the final FPB
file. Be aware that the specified spool size is only approximate and
is not a hard limit on the maximum amount of memory to use. You may
need to experiment a bit if you have tight constraints, and this
option might not be as useful as I thought it was.
The value must be a size in bytes, though suffixes like M or MB for megabyte and T or TB for terabyte are also allowed. These are in base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed between the number and the suffix, so “200MB” is okay but “200 MB” is not. The size must be at least 20 MB.
Here is an example of how to convert the CACTVS fingerprints from all of PubChem to an FPB file, using a relatively small limit of 200 MB:
sdf2fps --pubchem pubchem/Compound_*.sdf.gz | \
fpcat --max-spool-size 200MB -o pubchem.fpb
This will take a while! The sdf2fps alone takes almost 45 minutes on a ca. 2017-era Haswell machine.
If I save the intermediate results to an FPS file then the in-memory fpcat conversion from FPS to FPB takes 5½ minutes and requires 25GB of memory.
With spool of 200MB, the conversion takes nearly 10 minutes. According
to htop, the spooled conversion required, near the peak, 13.3G of
virtual memory, a resident set size of 12G, and 10.6G of shared shared
pages. The shared pages are from memory-mapping the intermediate FPB
files, so this probably required only 2GB of real memory.
If I use a 1GB spool size, the conversion time decreases from 10 to 8 minutes, and uses about the same amount of peak memory.
The temporary files will be placed under the appropriate temporary
directory for your operating system. If that disk isn’t large enough
for the intermediate files then use the --tmpdir option of fpcat
to specify an alternate directory:
fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb \
--tmpdir /usr/tmp
Another option is to specify the directory location using the TMPDIR, TEMP, or TMP environment variables, which are resolved in that order. The details are described in the Python documentation for tempfile.tempdir.
There is a limitation to the current implementation. When it merges the identifiers from the temporary files it stores all of the identifiers in memory. Assuming the identifiers are short, it takes a bit over 32 GB of RAM to generate an FPB file with 1 billion fingerprints.
Chemfp should support more than 1 billion fingerprints, but I have not tested it. Internally it uses 32-bit integers so does not support more than 4.2 billion fingerprints, and I wouldn’t be surprised if there is a problem between signed and unsigned values at 2.1 billion fingerprints.
Faster gzip decompression
In this section you’ll learn how to use an external program to improve the performance of reading gzip files. You will need PubChem file Compound_016000001_016500000.sdf.gz.
The PubChem file Compound_016000001_016500000.sdf.gz is one of the largest files from PubChem, measured by compressed size. The copy I have is 544M compressed and 3.9G uncompressed.
It takes about 10 seconds to extract the PubChem fingerprints from the compressed file:
% sdf2fps --pubchem Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 100%|█████████| 570M/570M [00:09<00:00, 57.2Mbytes/s]
In comparison, it takes about 5 seconds to extract the fingerprints from the uncompressed file:
% sdf2fps --pubchem Compound_016000001_016500000.sdf -o pubchem.fps
Compound_016000001_016500000.sdf: 100%|███████████████| 4.19G/4.19G [00:05<00:00, 770Mbytes/s]
For this case, gzip decompression adds 100% overhead to the processing step!
There are two sources for the overhead. First, a fair amount of chemfp’s internal gzip processing uses Python. Second, it’s single-threaded, that is, a block of text is decompressed, then passes to the FPS parser.
Chemfp has the option to run an external program to handle decompression. Use the environment “CHEMFP_GZCAT” to specify the program to use. If called with no arguments, it must read from stdin. If called with one argument, it must read from the named file. The decompressed data must be written to stdout, so it can be read by chemfp.
In most cases you’ll use “zcat” (for Linux-based OSes) or “gzcat” (for macOS).
Here’s an example:
% env CHEMFP_GZCAT=gzcat sdf2fps --pubchem \
Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 443113 recs [00:08, 52785.22 recs/s]
(The progress output can’t give a progress bar because chemfp doesn’t know the current read position relative to the entire file.)
This took about 8 seconds, which is 2 seconds faster than usual, or only about 60% overhead.
An improvement of two seconds may not sound like much, but the 10% performance gain is nice when processing many gzip-compressed files.
bzip2, xv, and other compression formats
In this section you’ll learn how to read from a compressed file which is not gzip or Zstandard compressed.
While gzip and Zstandard are the most common compression formats I come across, there are others. The previous section showed how to use the “zcat” or “gzcat” command-line tool as an external decompressor for gzip-compressed files.
Chemfp actually does have native support for bzip2, based on Python’s built-in bz2 module.
% file Compound_016000001_016500000.smi.bz2
Compound_016000001_016500000.smi.bz2: bzip2 compressed data, block size = 900k
% rdkit2fps --no-metadata Compound_016000001_016500000.smi.bz2 | head -1 | fold
00004008102040002000000020001000000820000100008100000000000a00100000000000400004
20000000000000040004000000000000000000004005000100000000000010400000000000000000
000400000080008080600200000000000000000080080001000000000048000820000000c0000000
00000000100000000080000002100000010000000000000601000000008000000000000000000000
28000000000000000020000010000000000000000100000000000000001200000000020420000000
0000000001000000000000000000004c000440000000000000000081020000100001020000000020
01000000109000040100002000000000 16000001
However, it’s not well tested. (Let me know if you are using it.) The better tested code path is to tell chemfp the format is in “smi.gz” format, and set the CHEMFP_GZCAT environment variable to “bzcat”, so the bzcat command is used to decompress the file, like this:
% env CHEMFP_GZCAT=bzcat rdkit2fps --in smi.gz --no-metadata \\
Compound_016000001_016500000.smi.bz2 | head -1 | fold
00004008102040002000000020001000000820000100008100000000000a00100000000000400004
20000000000000040004000000000000000000004005000100000000000010400000000000000000
000400000080008080600200000000000000000080080001000000000048000820000000c0000000
00000000100000000080000002100000010000000000000601000000008000000000000000000000
28000000000000000020000010000000000000000100000000000000001200000000020420000000
0000000001000000000000000000004c000440000000000000000081020000100001020000000020
01000000109000040100002000000000 16000001
The same mechanism can be used to decompress other formats. The following example uses the “xzcat” command as an external program to read an xz-compressed file:
% file Compound_016000001_016500000.smi.xz
Compound_016000001_016500000.smi.xz: XZ compressed data, checksum CRC64
% env CHEMFP_GZCAT=xzcat rdkit2fps --in smi.gz --no-metadata \\
Compound_016000001_016500000.smi.xz | head -1 | fold
00004008102040002000000020001000000820000100008100000000000a00100000000000400004
20000000000000040004000000000000000000004005000100000000000010400000000000000000
000400000080008080600200000000000000000080080001000000000048000820000000c0000000
00000000100000000080000002100000010000000000000601000000008000000000000000000000
28000000000000000020000010000000000000000100000000000000001200000000020420000000
0000000001000000000000000000004c000440000000000000000081020000100001020000000020
01000000109000040100002000000000 16000001
Generate fingerprints in parallel and merge to FPB format
In this section you’ll learn how to merge multiple sorted fingerprints into a single FPB file.
The previous section used a single shell command to extract the PubChem/CACTVS fingerprints from PubChem and generate an FPB file. This is easy to write and understand, but more complex versions may be more appropriate.
For one, I have four cores on my desktop computer, and I want to use them to process the PubChem files in parallel. The previous section was only single threaded.
I have all my PubChem files in ~/pubchem/. For each
“Compound_*.sdf.gz” file in that directory I want to extract the
CACTVS/PubChem fingerprints and create an intermediate FPS file in the
local directory. That’s equivalent to running the following commands:
sdf2fps --pubchem ~/pubchem/Compound_000000001_000500000.sdf.gz \\
-o Compound_000000001_000500000.fps
sdf2fps --pubchem ~/pubchem/Compound_000500001_001000000.sdf.gz \\
-o Compound_000500001_001000000.fps
... 291 more lines ...
except that I want to run four at a time.
This is what GNU Parallel was designed for. It’s a command-line tool which can parallelize the execution of other command-lines.
I’ll start by explaining the core command-line substitution pattern:
sdf2fps --pubchem {} -o {/..}.fps'
The {} will be replaced with a filename, and {/..} will be
replaced with the base filename, without the directory path prefix or
the two suffixes. That is, when {} is
“/Users/dalke/pubchem/Compound_000000001_000500000.sdf.gz” then
{/..} will be “Compound_000000001_000500000.fps”.
Since I want to generate an FPS file, I added the “.fps” as a suffix to the second substitution parameter.
I then tell GNU parallel which command-line to use, along with a few other parameters. Here’s the full line, which I split over two lines to make it more readable:
parallel --plus --no-notice --bar 'sdf2fps --pubchem {}
-o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz
The --plus tells GNU parallel to recognize an expanded set of
replacement strings. (“{/..}” is not part of the standard set of
patterns.)
The --no-notice tells it to not display the message about citing
GNU parallel in scientific papers.
The --bar enables a progress bar, which looks like this:
30% 88:205=11m17s /Users/dalke/pubchem/Compound_045500001_046000000.sdf.gz
This status line shows that processing is 30% complete, which is file 88 out of 205, and there’s an estimated 11 minutes and 17 seconds remaining.
Finally, the “:::” indicates that the remaining options are the list of parameters to pass to the command-line template for parallelization.
After about 21 minutes, using 4 CPUs on my laptop (with an effective scaling of 2.8), I now have a large number of FPS files, which I want to merge into a single FPB file. I’ll use fpcat:
fpcat --max-spool-size 1GB Compound*.fps -o pubchem.fpb
Unfortunately my laptop ran out of disk space, so I’ll just leave it a that; re-doing the same command on a server machine won’t provide you any new information.
GDB-13 fingerprints in one FPB
In this section you’ll learn about the performance problems that can occur with very large (~1 billion) fingerprint dataset sets.
Chemfp can evaluate over 200 million fingerprints per second, so searching 1 billion fingerprints should only take about 5 seconds, right?
Alas, there are some practical problems when you try to do this.
To start with, the 200 million/second performance is only if the fingerprints are in RAM. Storing 1 billion 1024-bit uncompressed fingerprints takes 120 GB (to say nothing of the space needed for the identifiers). Certainly there are compute servers with terabytes of memory, but few desktops and fewer laptops have that much RAM.
Chemfp can create and use FPB files which are larger than memory. Previous sections showed how to generate fingerprints in parallel and to create large FPB. By default chemfp uses a read-only memory-map to access the FPB content, which lets the operating system move data into and out of memory as needed.
For example, I downloaded GDB-13, which has 13 files, named 1.smi up to 13.smi. Each file contains one SMILES string per line. The file 5.smi contains SMILES strings for all enumerated structures with 5 atoms. There are 977,468,301 SMILES strings across all the files, which is a bit odd as the web page says there are “977 468 314 structures”, which is 13 larger. My number comes from the following, which also gives a preview of a performance issue when using almost 1 billion records.
% time wc -l {1,2,3,4,5,6,7,8,9,10,11,12,13}.smi
1 1.smi
3 2.smi
12 3.smi
43 4.smi
158 5.smi
953 6.smi
6041 7.smi
39589 8.smi
272598 9.smi
1915091 10.smi
13900390 11.smi
107061450 12.smi
854271972 13.smi
977468301 total
real 2m6.340s
user 0m4.247s
sys 0m35.586s
I wrote the Python program below to read each SMILES from the 13 input files, add identifiers (liike “5_150” to mean the 150th SMILES in the file 5.smi) and generate 978 SMILES files from “shard_0000.smi” to “shard_0977.smi”, each containing 1,000,000 records, except the last with only 468,301.
The term “shard” here means that the file contains a subset of the data. All shards together make up the entire data set.
# I call this "gdb_split.py"
import sys
OUTPUT_SHARD_SIZE = 1_000_000
output_shard = 0
outfile = None
output_lineno = 0
## Helper function to show status information to stderr.
# Clear the line with "\r" to move to the start of line,
# followed by spaces to replace old text, followed by "\r"
# to move to the start of line again.
_clear_str = ""
def status(msg):
global _clear_str
sys.stderr.write(_clear_str)
sys.stderr.write(msg)
_clear_str = "\r" + (" " * len(msg)) + "\r"
sys.stderr.flush()
# Process 1.smi to 13.smi
for input_shard in range(1, 14):
input_filename = f"{input_shard}.smi"
with open(input_filename) as infile:
# Read the SMILES from each line in the file
for lineno, line in enumerate(infile, 1):
smiles = line.rstrip()
# Create a SMILES record using a made-up identifier
output_record = f"{smiles} {input_shard}_{lineno}\n"
# Open the next output shard file if one isn't in use.
if outfile is None:
output_filename = f"shard_{output_shard:04d}.smi"
outfile = open(output_filename, "w")
output_lineno = 0
output_shard += 1
# Write the record and give some progress feedback
outfile.write(output_record)
output_lineno += 1
if lineno % 100_000 == 0:
status(
f"{input_filename}:{lineno} -> "
f"{output_filename}:{output_lineno}")
# Stop writting to this output shard if too large.
if output_lineno == OUTPUT_SHARD_SIZE:
outfile.close()
outfile = None
if outfile is not None:
outfile.close()
status("")
print(f"Created {output_shard} shards.")
I then used the techniques from the previous section to generate 1024-bit Morgan3 fingerprints in parallel (it was fun to see all my cores working away), to generate the files “shard_0000.smi.fps” to “shard_0977.smi.fps”. I’ll leave out the details.
The last step, actually making the FPB file, wasn’t as much fun. I first ran out of disk space because I filled up the temp directory:
% time fpcat --max-spool-size 2GB shard_*.fps -o gdb13_0_8.fpb
Traceback (most recent call last):
File "/home/dalke/bin/fpcat", line 8, in <module>
sys.exit(run_fpcat())
^^^^^^^^^^^
File "/home/dalke/chemfp/cli/__init__.py", line 145, in run_fpcat
return add_toplevel_options(get_fpcat_main())()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.... most of the Python traceback omitted ....
File "/home/dalke/chemfp/fpb_io.py", line 444, in _write_arena_chunk
output.write(arena[start:end])
File "/usr/lib/python3.12/tempfile.py", line 638, in func_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device
real 2m34.133s
user 1m25.939s
sys 0m28.916s
The default temporary directory for my system is /tmp, which has only 16 GB of space. Oops! I’ll create a local directory for fpcat to use, on a disk with 284G free, and try again:
% mkdir tmp
% time fpcat --tmpdir tmp --max-spool-size 2GB shard_*.fps -o gdb13.fpb
Killed
real 89m34.465s
user 27m15.216s
sys 10m29.036s
The “Killed” is because fpcat ran out of memory. This
machine has 32 GB of RAM with 31 GB free, and 0.5 GB of swap space.
Even though the --max-spool-size 2GB keeps memory use to between 4
and 8 GB for most of the processing (and I should probably have used
4GB), there’s a step at the end when the identifiers are loaded into
memory, along with the index. Unfortunately, I had some other programs
running which were using about 10 GB of RAM, causing memory to fill
up.
I turned the computer off and on again, which seemed to fix things.
The more fundamental problem is that when I designed the FPB file writer I expected that fingerprint data might exceed available memory, but did not consider that identifiers might do the same. Fixing this is on the to-do list.
The 977 million fingerprints from GDB-13 is just small enough to run with 31 GB of space. While watching htop I saw fpcat using 29 GB of memory, and it may have needed a bit more when I wasn’t looking. (A few months ago I tried 1 billion fingerprints, and ran out of memory.)
% time fpcat --tmpdir tmp --max-spool-size 2GB shard_*.fps -o gdb13.fpb
real 125m17.111s
user 39m5.964s
sys 11m42.337s
Yes, that’s 2 hours of real time and only 50 minutes of user space+system time, which tells me that this spent a lot of time waiting for the file system. I’ll see about improving that (and adding a progress indicator) when I revisit the FPB writer.
The gdb13.fpb file is 148 GB or about 162 bytes per record. (Of that, 128 bytes are for the 1024-bit fingerprint, 12 for the identifier, and the rest for the hash table and other chemfp overhead.)
% ls -lh gdb13.fpb
-rw-rw-r-- 1 dalke dalke 148G Sep 19 13:53 gdb13.fpb
Chemfp can memory-map the FPB file, so the fingerprint records are accessible even though 148 GB is far larger than the available 31-or-so GB. The following uses the Python API to get the fingerprint ids with the least and most on-bits set, and the start of the fingerprint at index 500 million.
>>> import chemfp
>>> arena = chemfp.load_fingerprints("gdb13.fpb")
>>> len(arena)
977468254
>>> arena.ids[0]
'1_1'
>>> arena.ids[-1]
'13_338610750'
>>> arena.fingerprints[500_000_000].hex()[:70]
'0000000012000010000001000008000081082000000000000018000000010004000000'
Or, from the command-line using chemfp simhistogram, here’s a 20-bin histogram of the distribution of Tanimoto scores between 25,000 randomly selected pairs out of the 478 quadrillion possible pairs:
% time chemfp simhist gdb13.fpb --bins 20 --seed 12345
histogram: 100%|████████████████████████████████████| 25000/25000 [05:17<00:00, 78.84/s]
#simhistogram/1 identity-bin=0
#type=sample matrix-type=upper-triangular bins=20 num-samples=25000 seed=12345
#size=477722093300170131 N=977468254
#identical=0
#average=0.078 min=0.053 max=0.103
#targets=gdb13.fpb
start end count percent
0.00 0.05 4744 18.976
0.05 0.10 14408 57.632
0.10 0.15 5303 21.212
0.15 0.20 515 2.060
0.20 0.25 28 0.112
0.25 0.30 2 0.008
0.30 0.35 0 0.000
0.35 0.40 0 0.000
0.40 0.45 0 0.000
0.45 0.50 0 0.000
0.50 0.55 0 0.000
0.55 0.60 0 0.000
0.60 0.65 0 0.000
0.65 0.70 0 0.000
0.70 0.75 0 0.000
0.75 0.80 0 0.000
0.80 0.85 0 0.000
0.85 0.90 0 0.000
0.90 0.95 0 0.000
0.95 1.00 0 0.000
Let’s search the FPB for the 10 nearest neighbors to
Cn1cnc2c1c(=O)[nH]c(=O)n2C, with tsv output:
% time simsearch -k 10 --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
--out tsv gdb13.fpb
query_id target_id score
Query1 13_555746559 1.0000000
Query1 13_555746560 1.0000000
Query1 13_827949621 0.4285714
Query1 13_555753216 0.4230769
Query1 12_63930155 0.4117647
Query1 13_479073810 0.4117647
Query1 13_376667801 0.4081633
Query1 12_70580805 0.4038462
Query1 13_538327509 0.4038462
Query1 13_555747551 0.4000000
real 14m44.610s
user 0m29.152s
sys 3m59.085s
That’s pretty slow. Okay, very slow. Even at only 100 million fingerprints per second it should only take about 10 seconds to search this dataset. What’s going on?
Memory-mapping means chemfp can access the data, but it doesn’t mean access will be fast. If the fingerprint data isn’t in memory, the operating system will need to load it from the disk, which in this case is a 9 TB hard disk with spinning rust, which I use for large datasets. Of the almost 14.75 minutes needed, chemfp used 30 seconds and the system used 4 minutes, leaving about 10 minutes just waiting for the disk.
Simply reading that data from the disk takes time. The following shows that it takes 16.5 minutes to compute the MD5 checksum of the entire file.
% time md5sum gdb13.fpb
db8987431a718249b8d18ed6d464afb9 gdb13.fpb
real 16m32.556s
user 5m36.947s
sys 1m27.945s
This is 45 seconds slower than the Tanimoto similarity search. It’s hard to make a direct comparison because chemfp uses a memory-map with some jumps in the access patterns while md5sum does a linear read, but one thing to remember is that BitBound search means chemfp does not need to read the entire file to do a complete search.
If it takes 15 minutes to read the data, and presumably at least the same to write that amount of data, then fpcat needed to read the data, save it to intermediate spool files, and load it again to merge and save the final file, which shows how at least 1 hour of the 2 hours needed to create gdb13.fpb was just for file I/O.
Single query shardsearch
In this section you’ll learn how to use chemfp shardsearch command for a nearest-neighbor search of a single query against multiple target files. You will need the GDB-13 fingerprints in FPS format created earlier.
In the previous section you saw how it was (just barely) possible to generate and use a 148 GB FPB file even on a machine with 32 GB of RAM. It wasn’t all that easy though.
Instead of creating one large FPB file, what about creating many smaller ones, perhaps 98 FPB files each with 10 million fingerprints (except the last, of course)? I made the subdirectory “10M_shards” then did:
fpcat shard_000*.smi.fps -o 10M_shards/shard_00.fpb
fpcat shard_001*.smi.fps -o 10M_shards/shard_01.fpb
fpcat shard_002*.smi.fps -o 10M_shards/shard_02.fpb
... all the way up to ...
fpcat shard_097*.smi.fps -o 10M_shards/shard_97.fpb
(Rather, I made a Python program which generated these shell commands, saved the output to a file as a script, then ran the script and waited.) The result looks like:
% ls 10M_shards/*.fpb
10M_shards/shard_00.fpb 10M_shards/shard_33.fpb 10M_shards/shard_66.fpb
10M_shards/shard_01.fpb 10M_shards/shard_34.fpb 10M_shards/shard_67.fpb
... many lines removed ..
10M_shards/shard_31.fpb 10M_shards/shard_64.fpb 10M_shards/shard_97.fpb
10M_shards/shard_32.fpb 10M_shards/shard_65.fpb
It’s of course possible to write a script to call simsearch once for each FPB shard and merge the result. A likely
easier alternative is to use chemfp’s shardsearch subcommand, which is designed for that task. It
implements most of the simsearch command-line options, including
passing in a --query SMILES.
% time chemfp shardsearch -k 10 --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
--out tsv 10M_shards/shard_*.fpb
query_id target_id score
Query1 13_555746559 1.0000000
Query1 13_555746560 1.0000000
Query1 13_827949621 0.4285714
Query1 13_555753216 0.4230769
Query1 12_63930155 0.4117647
Query1 13_479073810 0.4117647
Query1 13_376667801 0.4081633
Query1 12_70580805 0.4038462
Query1 13_538327509 0.4038462
Query1 13_555747551 0.4000000
real 14m49.140s
user 0m31.871s
sys 3m31.437s
This used the fingerprint type information from the first shard (shard_00.fpb) to convert the SMILES string into an RDKit Morgan fingerprint, then used that fingerprint for a multi-part similarity search. It found the 10-nearest neighbors in shard_00.fpb and the 10th-best score T. It then found the 10-nearest neighbors in shard_01.fpb which are at least T similar to the query. It then merged these two lists together, found the 10-nearest, and the (possibly new) 10th-best score.
This process repeats for each shard, with the final merged list written as output in the requested format.
Even though it may sound like a lot of work, the elapsed real time is only 5 seconds slower than searching the single-file FPB, and that may be due to machine variability. Since we’re pretty sure disk I/O is the main limit of overall performance, what this is really telling us is that both approaches read the same amount of data.
I can easily double-check because the shardsearch merge search I just
described is multi-threaded. It can process P shards in parallel,
with a shared list (initially empty) of the current best scores. When
a thread is free, it gets the current 10th-best score T and the shard
to process. It does the 10-nearest search and then updates the shard
list with the shard results.
I’ll have shardsearch use first 2 then 3 threads in the pool, using
first the --num-shards-in-parallel option and then its shorthand
form -P:
% time chemfp shardsearch -P 2 -k 10 --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
--out tsv 10M_shards/shard_*.fpb
query_id target_id score
Query1 13_555746559 1.0000000
Query1 13_555746560 1.0000000
Query1 13_827949621 0.4285714
Query1 13_555753216 0.4230769
Query1 12_63930155 0.4117647
Query1 13_479073810 0.4117647
Query1 13_376667801 0.4081633
Query1 12_70580805 0.4038462
Query1 13_538327509 0.4038462
Query1 13_555747551 0.4000000
real 32m42.817s
user 0m35.133s
sys 3m38.704s
% time chemfp shardsearch -P 3 -k 10 --query 'Cn1cnc2c1c(=O)[nH]c(=O)n2C' \
--out tsv 10M_shards/shard_*.fpb
query_id target_id score
Query1 13_555746559 1.0000000
Query1 13_555746560 1.0000000
Query1 13_827949621 0.4285714
Query1 13_555753216 0.4230769
Query1 12_63930155 0.4117647
Query1 13_479073810 0.4117647
Query1 13_376667801 0.4081633
Query1 12_70580805 0.4038462
Query1 13_538327509 0.4038462
Query1 13_555747551 0.4000000
real 36m11.288s
user 0m36.407s
sys 3m40.840s
Using a thread pool more than doubled the time, almost certainly because the different threads were trying to access different parts of the disk, causing the read head to jump around.
Note
The --num-shards-in-parallel option does not
seem that useful, and will likely be removed in a future
version of chemfp. Let me know if you find good a use
case.
Approximating count fingerprints with binary fingerprints
In this section you’ll learn how to use the rdkit2fps command to generate “superimposed” and “countsim” binary fingerprints using RDKit’s count fingerprints. These binary fingerprints are better at approximating the count Tanimoto than the default method. You will need a copy of chembl_36.sdf.gz from the ChEMBL 36 release.
RDKit’s fingerprint types, like Morgan and atom pairs, can also be used to generate count fingerprints. The command-line tool chemfp rdkit2fpc will generate them in FPC format, but people more often use them as binary fingerprints.
For example, I’ll compare the Morgan2 count fingerprint for decane with count simulation turned off (which is the default for Morgan fingerprints), turned on (which uses RDKit’s built-in count simulation method), and chemfp’s “superimposed” method (I’ve folded the FPC output across multiple lines for clarity):
% echo "CCCCCCCCCC C10" | \
chemfp rdkit2fpc --morgan2 --no-metadata
161963127:4,1173125914:2,1510461303:6,1685248591:2,2245384272:8,
2246728737:2,3542456614:2,3745584548:2 C10
% echo "CCCCCCCCCC C10" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation 0
0000000442000000008001000000800000000000100000000000000000000000 C10
% echo "CCCCCCCCCC C10" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation 1
00000000000000300f000000000300003000030300000000000000f000000000 C10
% echo "CCCCCCCCCC C10" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation superimposed
c800000000000001800001800005004912000200844000000c4300000002b400 C10
The “161963127:4” in the FPC output says the feature with index
161963127 occurs 4 times. There are 8 features in the FPC output. This
will set up to 8 bits in the binary fingerprint when using
--countSimulation 0. Close inspection of the output shows only 7
bits were set, which means two of the original features set the same
output bit.
RDKit’s count simulation method (which chemfp refers to as “countsim”) by default uses a 4-bit unary representation of the binary log, so a count of 1 sets 1 bit, a count of 2 or 3 sets 2 bits, a count of 4, 5, 6, or 7 sets 3 bits, and a count of 8 or larger sets 4 bits. That means the decane Morgan2 can set up to 18 bits in the output to 1, which is what occured here.
Chemfp’s “superimposed” method sets up to count bits for each feature count, as determined by successive calls to a pseudo-random number generator seeded with the feature id. That means the decane Morgan2 can set up to 28 bits in the output to 1, which is what occurred here.
How do the binary Tanimoto scores for the different conversion methods compare to the count Tanimoto? I’ll use nonane as the comparison structure and generate the corresponding fingerprints:
% echo "CCCCCCCCC C9" |\
chemfp rdkit2fpc --morgan2 --no-metadata
161963127:3,1173125914:2,1510461303:5,1685248591:2,2245384272:7,
2246728737:2,3542456614:2,3745584548:2 C9
% echo "CCCCCCCCC C9" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation 0
0000000442000000008001000000800000000000100000000000000000000000 C9
% echo "CCCCCCCCC C9" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation 1
000000000000003007000000000300003000030300000000000000f000000000 C9
% echo "CCCCCCCCC C9" | \
chemfp rdkit2fps --morgan2 --no-metadata --fpSize 256 --no-metadata --countSimulation superimposed
c800000000000001000001800001004912000200804000000c4300000002b400 C9
I’ll first compute the expected count Tanimoto, which is 0.89:
>>> from chemfp import countops
>>> fp1 = countops.parse_fpc(
... "161963127:4,1173125914:2,1510461303:6,1685248591:2,2245384272:8,"
... "2246728737:2,3542456614:2,3745584548:2")
>>> fp2 = countops.parse_fpc(
... "161963127:3,1173125914:2,1510461303:5,1685248591:2,2245384272:7,"
... "2246728737:2,3542456614:2,3745584548:2")
>>> countops.count_tanimoto(fp1, fp2)
0.8928571428571429
I’ll then compare it to each of the converted bit fingerprints:
>>> from chemfp import bitops
>>> no_simulation_score = bitops.hex_tanimoto(
... "0000000442000000008001000000800000000000100000000000000000000000",
... "0000000442000000008001000000800000000000100000000000000000000000")
>>> no_simulation_score
1.0
>>> countsim_score = bitops.hex_tanimoto(
... "00000000000000300f000000000300003000030300000000000000f000000000",
... "000000000000003007000000000300003000030300000000000000f000000000")
>>> countsim_score
0.9444444444444444
>>> superimposed_score = bitops.hex_tanimoto(
... "c800000000000001800001800005004912000200844000000c4300000002b400",
... "c800000000000001000001800001004912000200804000000c4300000002b400")
>>> superimposed_score
0.8928571428571429
In this case the superimposed Tanimoto matched the original count Tanimoto exactly, and the countsim score was a better approximation than using no count simulation at all. In my preprint “Superimposed Coding of Count Fingerprints to Binary Fingerprints I show that superimposed coding is in general the best of these three encoding, if the goal is to approximate near-neighbor search of count fingerprints using a near-neighbor search with (sparse) binary fingerprints.
Let’s try that on a real dataset. The following generates 2048-bit superimposed Morgan3 fingerprints from chembl_36.sdf.gz:
% rdkit2fps --countSimulation superimposed chembl_36.sdf.gz \
-o chembl_36_morgan3_superimposed.fpb --progress
... many warning messages omitted ..
chembl_36.sdf.gz: 100%|████████████████████| 936M/936M [07:58<00:00, 1.96Mbytes/s]
Note
A valid chemfp license key is needed for both FPB generation and for generating more than 50,000 superimposed fingerprints in the same process.
I’ll search for the 5-nearest neighbors of dodecane, where the
--using option gets the fingerprint type information from the FPB
file, rather than needing to specify the appropriate parameters on the
command-line:
% echo "CCCCCCCCCCCC C12" | \
rdkit2fps --using chembl_36_morgan3_superimposed.fpb | \
simsearch chembl_36_morgan3_superimposed.fpb -k 5 --out csv
query_id,target_id,score
C12,CHEMBL30959,1.0000000
C12,CHEMBL135694,0.9130435
C12,CHEMBL132474,0.9047619
C12,CHEMBL135488,0.8400000
C12,CHEMBL134537,0.8095238
Not surprisingly, the perfect match CHEMBL30959 is dodecane itself. The next four matches are tridecane CHEMBL135694, undecane CHEMBL132474, tetradecane CHEMBL135488, and decane CHEMBL134537, which matches the order I expect from the count Tanimoto.
With no count simulation, dodecane finds itself, but can’t distinguish dodecane from other structures:
% rdkit2fps --progress chembl_36.sdf.gz -o chembl_36_morgan3.fpb
... many warning messages omitted ..
chembl_36.sdf.gz: 100%|█████████████| 936M/936M [07:41<00:00, 2.03Mbytes/s]
% echo "CCCCCCCCCCCC C12" | \
rdkit2fps --using chembl_36_morgan3.fpb | \
simsearch chembl_36_morgan3.fpb -k 5 --out csv
query_id,target_id,score
C12,CHEMBL30959,1.0000000
C12,CHEMBL132474,1.0000000
C12,CHEMBL134994,1.0000000
C12,CHEMBL135694,1.0000000
C12,CHEMBL257490,1.0000000
With RDKit’s count simulation, dodecane and tridecane (CHEMBL135694) have identical scores, with pentadecane (CHEMBL1234557) and tetradecane (CHEMBL135488) tied for next place:
% rdkit2fps --progress chembl_36.sdf.gz -o chembl_36_morgan3_countsim.fpb \
--countSimulation 1
... many warning messages omitted ..
chembl_36.sdf.gz: 100%|█████████████| 936M/936M [08:06<00:00, 1.92Mbytes/s]
% echo "CCCCCCCCCCCC C12" | \
rdkit2fps --using chembl_36_morgan3_countsim.fpb | \
simsearch chembl_36_morgan3_countsim.fpb -k 5 --out csv
query_id,target_id,score
C12,CHEMBL30959,1.0000000
C12,CHEMBL135694,1.0000000
C12,CHEMBL1234557,0.9655172
C12,CHEMBL135488,0.9655172
C12,CHEMBL134994,0.9333333
This admittedly cherry-picked example supports my belief that superimposed coding produces binary fingerprints where the binary Tanimoto is an effective approximation of the count Tanimoto of the original count fingerprints.
Generate a full Tanimoto similarity array
In this section you’ll learn how to use the chemfp simarray
command to generate the complete array containing the Tanimoto
similarity between every fingerprint in a dataset with every other
fingerprint, including itself. (See Use simarray to generate a NumPy array for an
example using the chemfp.simarray() API function.)
The simsearch command generates a sparse matrix, because it was designed for task like “find the nearest 10 fingerprints” or “find fingerprints which are at least 0.4 similar.”
Many algorithms, like some of the sckit-learn clustering methods, require a complete NxN comparison matrix between every element and every other element, including with itself. While this is possible with simsearch, by using a threshold of 0.0, doing so adds some performance and space overhead.
Chemfp 4.2 added “simarray” functionality, to compute the entire matrix either as a NumPy array, or in a file format compatible with loading into NumPy.
The chemfp simarray subcommand by default generates this matrix in memory and saves it to NumPy’s npy format.
Important
You must install NumPy for this subcommand to work.
I’ll generate the full matrix using the 10,969 PubChem fingerprints extracted earlier, and available in “pubchem_queries.fps”:
% chemfp simarray pubchem_queries.fps -o pubchem_queries_NxN.npy
pubchem_queries.fps: 100%|██████████████| 2.54M/2.54M [00:00<00:00, 179Mbytes/s]
scores: 100%|████████████████████████████████| 60.2M/60.2M [00:00<00:00, 125M/s]
NumPy can read the “npy” file directly:
>>> import numpy
>>> arr = numpy.load("pubchem_queries_NxN.npy")
>>> arr[:3,:4]
array([[1. , 0.44635193, 0.18627451, 0.5258216 ],
[0.44635193, 1. , 0.25543478, 0.53921569],
[0.18627451, 0.25543478, 1. , 0.25142857]])
The npy file actually stores up to four arrays, depending on how the data was generated. The second array stores the simarray details as JSON-encoded string, and the third and optional fourth store the target and query ids, respectively.
These can be read directly from NumPy:
>>> import numpy, pprint, json
>>> f = open("pubchem_queries_NxN.npy", "rb")
>>> numpy.load(f)[:3,:4]
array([[1. , 0.44635193, 0.18627451, 0.5258216 ],
[0.44635193, 1. , 0.25543478, 0.53921569],
[0.18627451, 0.25543478, 1. , 0.25142857]])
>>> pprint.pprint(json.loads(str(numpy.load(f))))
{'dtype': 'float64',
'format': 'multiple',
'matrix_type': 'NxN',
'method': 'Tanimoto',
'metric': {'as_distance': False,
'is_distance': False,
'is_similarity': True,
'name': 'Tanimoto'},
'metric_description': 'Tanimoto similarity',
'num_bits': 881,
'shape': [10969, 10969]}
>>> numpy.load(f)[:4] # show the first four array identifiers
array(['99000039', '99000230', '99001517', '99002251'], dtype='<U8')
or loaded by using chemfp’s load_simarray() helper function:
>>> import chemfp
>>> simarr = chemfp.load_simarray("pubchem_queries_NxN.npy")
>>> simarr.out[:3, :4]
array([[1. , 0.44635193, 0.18627451, 0.5258216 ],
[0.44635193, 1. , 0.25543478, 0.53921569],
[0.18627451, 0.25543478, 1. , 0.25142857]])
>>> simarr
SimarrayFileContent(metric=<Tanimoto similarity>,
out=<10969x10969 symmetric array of dtype float64>,
query_ids=target_ids=['99000039', '99000230', '99001517', ...])
Unlike simsearch, simarray does NOT change the record order - the fingerprints are kept in input order rather than sorting by popcount, which you can check by comparing the first few identifiers in the previous output to those in the FPS file:
% awk '/^[^#]/ {print $2}' pubchem_queries.fps | head -5
99000039
99000230
99001517
99002251
99003537
Why? I compared several simsearch implementations, including one which was optimized by taking advantage of popcount ordering. I saw no noticable difference. My working belief is that the problem is mostly limited by memory write performance, while simsearch is mostly limited by memory read performance. I don’t know enough about hardware to fully explore that.
In practice, preserving input order also makes it much easier to map between the input data set and the array, without having to figure out the reordering.
Lastly, for symmetric arrays use --no-lower-triangle so simarray
does not compute the comparisons for the lower triangle, but instead
leaves them as zeros. This gives faster performance but does not save
any space.
Generate the array using another metric or datatype
In this section you’ll learn how simarray supports several different comparison types, and how some of the comparison types can be stored in several different data types.
By default simarray generate Tanimoto scores stored as 64-bit floating point number (also called “float64” or what C calls a “double”). Why Tanimoto? Because it’s by far the most popular similarity method in cheminformatics.
Simarray supports a few other comparison types using the --metric
flag: “Tanimoto” similarity, “Dice” similarity, and “cosine”
similarity, as well as the “Hamming” distance. In addition, with the
--as-distance flag it will turn the similarity calculations into a
distance computed as 1-similarity.
For example, the following computes the Jaccard distance (1-Tanimoto score):
chemfp simarray pubchem_queries.fps --metric Tanimoto --as-distance \
-o distances.npy
If you are using 4096 bit fingerprints or smaller then every Tanimoto score can be represented uniquely with float32, which uses half as much memory as the default float64. (Even if you use larger fingerprints, you’ll need an awful lot of on-bits before it’s possible to find two different scores which have the same 32-bit representation.)
Use the --dtype option to change the array datatype. The
“Tanimoto”, “Dice”, and “cosine” metrics support “float64”, “float32”
and a scaled “uint16” which uses only the values 0 to 65535 (computed
as int(score*65535)) for similarity or 65535-int(score*65535) for
distance). The following computes the cosine distance as “uint16”, so
identical fingerprints have a distance of 0:
chemfp simarray pubchem_queries.fps --metric cosine \
--as-distance --dtype uint16 -o pubchem_cosine_dist.npy
Here’s what it looks like in NumPy:
>>> import numpy as np
>>> np.load("pubchem_cosine_dist.npy")[:3,:4]
array([[ 0, 25064, 42640, 20252],
[25064, 0, 36278, 19585],
[42640, 36278, 0, 37078]], dtype=uint16)
The “Tanimoto” and “Dice” metrics also support the “rational64” and “rational32” data types, which use a NumPy structured data type where the first term (“p”) is the numerator and the second term (“q”) is the denominator. Rational64 uses two uint32 values and rational32 uses two uint16 values, which means rational32 uses the same space as float32 but with no precision loss.
These store the scores as rational values, in unreduced form:
>>> import numpy as np
>>> scores = np.load("pubchem_rational32.npy")
>>> scores[:3,:4]
array([[(174, 174), (104, 233), ( 38, 204), (112, 213)],
[(104, 233), (163, 163), ( 47, 184), (110, 204)],
[( 38, 204), ( 47, 184), ( 68, 68), ( 44, 175)]],
dtype=[('p', '<u2'), ('q', '<u2')])
>>> scores["p"][:3,:4]
array([[174, 104, 38, 112],
[104, 163, 47, 110],
[ 38, 47, 68, 44]], dtype=uint16)
>>> scores["q"][:3,:4]
array([[174, 233, 204, 213],
[233, 163, 184, 204],
[204, 184, 68, 175]], dtype=uint16)
If the similarity score is 0/0 then the corresponding distance score is n/n where n is the fingerprint size.
The “Hamming” metric is only available as “uint16”. The Tversky similarity isn’t available in simarray, though it can be computed using one of the “abcd” metrics in the next section.
Generate a simarray using an “abcd” metric
In this section you’ll learn about simarray’s “abcd” data type, which the four individual components for comparing two fingerprints. These can be used to construct your own comparison value.
The “Willett”, “Sheffield”, and “Daylight” metrics all compute an “abcd” structured data type with four uint16 values, individually denoted “a”, “b”, “c”, and “d”. These can be used to create your own comparison function, and indeed the cheminformatics literature describes scores of these, with each formula described using these four values, like:
Baroni-Urbani/Buser: (sqrt(a*d) + a) / (sqrt(a*d) + a + b + c)
Unfortunately, different authors use different conventions for what those values mean. Fortunately, there are only two main conventions, which I’ll call “Willett” and “Sheffield”, along with the “Daylight” convention in a distant third. Chemfp’s simarray supports all three.
1) The “Willett” convention is the most widely used in the field, including internally by chemfp. It dates from around 1986 with the seminal work by Peter Willett and his collaborators, and was the primary convention used by the Sheffield group until around 2008 (earlier if John Holliday is one of the authors).
It uses the following definitions:
“a” = the number of on-bits in the first fingerprint
“b” = the number of on-bits in the second fingerprint
“c” = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2
Note that a+b-c+d is the number of bits in the fingerprint.
2) The “Sheffield” convention was used in papers from the Sheffield group from 1973 up until around 1985, and by most papers from the Sheffield group (including by Peter Willett) after around 2008.
“a” = the number of on-bits in common
“b” = the number of on-bits unique to the first fingerprint
“c” = the number of on-bits unique to the second fingerprint
“d” = the number of off-bits in fp1 which are also off-bits in fp2
Note that a+b+c+d is the number of bits in the fingerprint.
3) The “Daylight” convention was introduced by John Bradshaw and used in the Daylight documentation and Daylight toolkit:
“a” = the number of on-bits unique to the first fingerprint
“b” = the number of on-bits unique to the second fingerprint
“c” = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2
Note that a+b+c+d is the number of bits in the fingerprint.
Suppose you want the Baroni-Urbani/Buser score mentioned above, which comes from a 2002 paper by Holliday, Hu, and Willett. You’ll need to read the paper to see it uses what chemfp calls the “Sheffield” convention (as a hint, Holliday is one of the authors), then ask simarray to compute the relevant matrix:
chemfp simarray pubchem_queries.fps --metric Sheffield \
-o pubchem_abcd.npy --no-progress
The final computation requires a bit of NumPy help, first to define the function, then to load the array, and finally to compute the values:
>>> import numpy as np
>>> def baroni_urbani(abcd_arr):
... a, b, c, d = abcd_arr["a"], abcd_arr["b"], abcd_arr["c"], abcd_arr["d"]
... numerator = np.sqrt(a * d) + a
... return numerator / (numerator + b + c)
...
>>>
>>> abcd_arr = np.load("pubchem_abcd.npy")
>>> abcd_arr[:2,:3]
array([[(174, 0, 0, 707), (104, 70, 59, 648), ( 38, 136, 30, 677)],
[(104, 59, 70, 648), (163, 0, 0, 718), ( 47, 116, 21, 697)]],
dtype=[('a', '<u2'), ('b', '<u2'), ('c', '<u2'), ('d', '<u2')])
>>>
>>> baroni_urbani(abcd_arr)[:2, :3]
array([[1. , 0.5327464 , 0.54444826],
[0.5327464 , 1. , 0.62465185]], dtype=float32)
Simarray and arrays larger than memory
In this section you’ll learn how to use simarray to handle arrays which are larger than the available memory.
Chemfp’s simarray generally creates the full comparison array in memory then saves the result to NumPy’s “npy” format. What if you want to generate an even larger array?
Why? Perhaps you, like me, primarily work on a laptop and don’t want to close all your applications just to free up a few gigabytes of data, and are willing to have extra file I/O overhead instead. Or perhaps you have 40 TB of network drive space free and want to wait the days needed to compute a 3M x 3M similarity array because that’s quicker than getting time on the “big memory” analysis machine with 50 TB of RAM.
The chemfp simarray command (but not the API!) supports the “bin”
output, which saves the output array as the raw binary data, in row
major format. It processes a window (by default, 2GB) of the output
array and appends it to the output file before processing the next
window.
The following example uses the “pubchem_queries.fps” and “pubchem_queries.fps” created in a previous section to generate the cosine similarity, as float32 values, between all queries and all targets:
% chemfp simarray --queries pubchem_queries.fps pubchem_targets.fps \
--metric cosine --dtype float32 -o NxM.bin
pubchem_targets.fps: 100%|█████████████████ 3.42M/3.42M [00:00<00:00, 95.2Mbytes/s]
pubchem_queries.fps: 100%|███████████████████| 2.54M/2.54M [00:00<00:00, 103Mbytes/s]
scores: 100%|█████████████████████████████████████| 162M/162M [00:01<00:00, 153M/s]
The output file contains the raw array value, which NumPy can access as a memory-mapped file, if you know the shape and dtype:
>>> import chemfp
>>> len(chemfp.load_fingerprints("pubchem_queries.fps"))
10969
>>> len(chemfp.load_fingerprints("pubchem_targets.fps"))
14753
>>> import numpy as np
>>> arr = np.memmap("NxM.bin", shape=(10969, 14753), dtype=np.float32)
>>> arr[:3, :4]
memmap([[0.75586504, 0.65900594, 0.6916995 , 0.5689655 ],
[0.56468934, 0.75383073, 0.61808264, 0.73629737],
[0.35343105, 0.45178595, 0.46850318, 0.349345 ]], dtype=float32)
Alternatively, if you prefer to load the data into memory instead of using a memory-mapped array:
>>> import pathlib
>>> content = pathlib.Path("NxM.bin").read_bytes()
>>> arr = np.ndarray(buffer = content, shape=(10969, 14753), dtype=np.float32)
>>> arr[:3, :4]
array([[0.75586504, 0.65900594, 0.6916995 , 0.5689655 ],
[0.56468934, 0.75383073, 0.61808264, 0.73629737],
[0.35343105, 0.45178595, 0.46850318, 0.349345 ]], dtype=float32)
While this works, it’s a bit cumbersome to track the shape and data type, and the array does not contain the query or target ids.
Instead, use the --metadata-output option to have simarray save
this data to an auxillary “npy” file:
% chemfp simarray --queries pubchem_queries.fps pubchem_targets.fps \
--metric cosine --dtype float32 -o NxM.bin --metadata-output NxM_aux.npy
pubchem_targets.fps: 100%|████████████████| 3.42M/3.42M [00:00<00:00, 96.9Mbytes/s]
pubchem_queries.fps: 100%|█████████████████| 2.54M/2.54M [00:00<00:00, 103Mbytes/s]
scores: 100%|█████████████████████████████████████| 162M/162M [00:00<00:00, 163M/s]
This “npy” format is the same as the normal simarray npy output, except the first array, which normally contains the full set of comparisons, is instead zero-sized. It exists to store the relevant NumPy dtype. Here’s a low-level example:
>>> import numpy as np
>>> f = open("NxM_aux.npy", "rb")
>>> np.load(f)
array([], shape=(0, 0), dtype=float32)
>>> import pprint, json
>>> pprint.pprint(json.loads(str(np.load(f))))
{'dtype': 'float32',
'format': 'multiple',
'matrix_type': 'NxM',
'method': 'cosine',
'metric': {'as_distance': False,
'is_distance': False,
'is_similarity': True,
'name': 'cosine'},
'metric_description': 'cosine similarity',
'num_bits': 881,
'shape': [10969, 14753]}
>>> np.load(f)[:3] # the first three target ids
array(['48500020', '48500053', '48500091'], dtype='<U8')
>>> np.load(f)[:3] # the first three query ids
array(['99000039', '99000230', '99001517'], dtype='<U8')
However, it’s much easier to use chemfp’s load_simarray(),
which accepts both the binary file and the auxillary npy file:
>>> import chemfp
>>> simarr = chemfp.load_simarray("NxM.bin", metadata_source="NxM_aux.npy")
>>> simarr
SimarrayFileContent(metric=<cosine similarity>, out=<10969x14753 array
of dtype float32>, query_ids=['99000039', '99000230', '99001517', ...],
target_ids=['48500020', '48500053', '48500091', ...])
If the data source is a bin file (based on the specified format of
“bin” or the filename extension “.bin”) then the array data by default
is memory-mapped, because the array is expected to be large, and the
memory-map is read-only (in “r” mode).
Use mmap_mode to change how this works. “r+” means read-write mode,
“c” means “copy-on-write”, and None means to load the data into
memory.
Uf the data source is a npy file then the array data by default is read into memory. The other mmap options, like “c”, can also be used with npy files.
Histogram generation
In this section you’ll learn how to use the chemfp simhistogram subcommand to compute a histogram of the Tanimoto scores of a set of fingerprints with itself, and how read the default output.
Sometimes it’s insightful to generate a histogram of the pairwise similarities in a dataset, or between two datasets, or both. It may, for example, give a quick overall sense of the relative similiarty between two datasets.
Let’s dive in by generating a histogram with 10 bins from Tanimoto scores in ChEMBL:
% chemfp simhistogram --bins 10 chembl_34.fpb
#simhistogram/1 identity-bin=0
#type=sample matrix-type=upper-triangular bins=10 num-samples=25000 seed=3650178697
#size=2902289761815 N=2409270
#identical=0
#average=0.11 min=0.06 max=0.16
#targets=chembl_34.fpb
start end count percent
0.0 0.1 10303 41.212
0.1 0.2 14052 56.208
0.2 0.3 619 2.476
0.3 0.4 21 0.084
0.4 0.5 3 0.012
0.5 0.6 2 0.008
0.6 0.7 0 0.000
0.7 0.8 0 0.000
0.8 0.9 0 0.000
0.9 1.0 0 0.000
The output contains a header section followed by the histogram data. The first line describes the format and the following header lines contains metadata about the query parameters and results.
In this case the “#type” line says the histogram with 10 bins was
generated from a sample of 25,000 Tanimoto scores (and not all of the
scores!). The samples were generated from the upper triangle of the
input fingerprints with itself. The sampler is initialized with a
seed, which in this case came from Python’s RNG because the --seed
wasn’t specified on the command-line.
The “#size” line says the full upper-triangle comparison matrix contains 2.9 trillion similarity pairs, computed as N*(N-1)/2 from the N=2.4 million fingerprints in the input.
The “#identical” line contains the number of scores which are exactly 1.0, which is 0 in this case.
The “#average” line shows the average score is around 0.11, though it could be anywhere from 0.06 to 0.16. This ambiguity is because the average value is computed by post-processing the histogram, and not from adding up the actual scores. If the 21 counts in the range 0.3 to 0.4 above were evenly distributed then the average would be close to 0.35, but looking at the output it’s likely that those counts are skewed more towards 0.3. What we know for certain is they can be no lower than 0.3 and no higher than 0.4. Applying the same logic to all bins tells us the lowest possible average is 0.06 and the highest possible average is 0.16, even if the average is likely close to 0.11.
The final metadata line, “#targets”, describes where the target fingerprints came from, which in this case is the filename.
The histogram data contains four tab-separated columns. The first data line contains the column titles and the remaining lines contain the bin data, in increasing order.
The “start” and “end” columns contains the left and right bin edges. The “count” column contains the number of scores in the bin, where start edge <= score < end edge, except the last line, where bin start edge <= score <= end edge. The “percent” column is the count divided by the total number of scores evaluated. (If there are no scores then the percentages are all 0.0.)
Histogram identity bin
In this section you’ll learn about the simhistogram --identity-bin
option and why it exists. You’ll also learn about the “csv” and “tsv”
output formats.
Chemfp reports the number of identical fingerprint pairs partially because it can reveal a problem with how the fingerprints were prepared. For example, if the dataset contains many chiral forms of the same structure, and the fingerprint type is not sensitive to chirality, then the number of identical pairs might reveal the need to remove chirality and de-duplicate the input structures.
It may also highlight a poor choice of fingerprint, like using Morgan1 fingerprints to characterize proteins.
Even the best fingerprint type may give identical fingerprints to different compounds. For example, CHEMBL32423, CHEMBL33593, CHEMBL32984, CHEMBL33359, CHEMBL24795, CHEMBL32734, CHEMBL32761, CHEMBL32604, CHEMBL32978, CHEMBL33424, CHEMBL32323, CHEMBL32366, CHEMBL33908, CHEMBL32547, CHEMBL35160, CHEMBL33084, CHEMBL36072, CHEMBL36526, CHEMBL35676, CHEMBL36362, CHEMBL36185, CHEMBL263738, CHEMBL286218, CHEMBL284810, CHEMBL285172, CHEMBL287121, CHEMBL291012, CHEMBL538749, CHEMBL542138, CHEMBL543096, CHEMBL544749, CHEMBL544520, CHEMBL544283, CHEMBL554184, CHEMBL36576, CHEMBL287690, and CHEMBL289451 are 37 records from ChEMBL 34 with different canonical SMILES but identical Morgan2 fingerprints.
When I look at a histogram distribution from a clustering of similar structures, I find myself wondering if the counts in the final bin are because the fingerprints are actually similar to each other, or an artifact because the fingerprint type cannot distinguish between the underlying structures.
I find it useful to include the number of identical scores as part of the analysis and visualization.
To give an example, I’ll use chemfp’s Python API to select all 165 fingerprints in ChEMBL 34 which are at least 0.5 similar to CHEMBL538749 and save the results to “cluster.fps”
>>> import chemfp
>>> arena = chemfp.load_fingerprints("chembl_34.fpb")
>>> query_fp = arena.get_fingerprint_by_id("CHEMBL538749")
>>> hits = chemfp.simsearch(query_fp=query_fp, targets=arena, threshold=0.5)
>>> len(hits)
165
>>> indices = indices=hits.get_indices()
>>> arena.copy(indices=indices).save("cluster.fps")
(I agree that there should be a simpler way to do this, like an option for simsearch to save the hit fingerprints to a file. That may be in a future chemfp release.)
Now back to the command-line to generate a histogram with 10 bins. I’ll also use “chemfp simhist”, which is the shorthand way to write “chemfp simhistogram”:
% chemfp simhist cluster.fps --bins 10
#simhistogram/1 identity-bin=0
#type=all matrix-type=upper-triangular bins=10
#size=13530 N=165
#identical=2762
#average=0.79 min=0.75 max=0.83
#targets=cluster.fps
start end count percent
0.0 0.1 0 0.000
0.1 0.2 0 0.000
0.2 0.3 0 0.000
0.3 0.4 329 2.432
0.4 0.5 315 2.328
0.5 0.6 3561 26.319
0.6 0.7 1731 12.794
0.7 0.8 192 1.419
0.8 0.9 5 0.037
0.9 1.0 7397 54.671
This histogram was based on the full set of 13,530 pairs. There are 329 scores below the 0.5 threshold because while all of the fingerprints are at least 0.5 similar to the query, they are not as similar to each other.
You can see there are two peaks – one in the 0.5-0.6 bin, and the other in the 0.9-1.0 bin. The latter has 7397 pairs, of which (from the “#identical” line) 2762 have a score of 1.0, leaving 4635 in the rest of the bin.
An alternative is to add --identity-bin, which put the 1.0 scores
into the its own bin in the range 1.0-1.0, like this:
% chemfp simhist cluster.fps --bins 10 --identity-bin
#simhistogram/1 identity-bin=1
#type=all matrix-type=upper-triangular bins=10
#size=13530 N=165
#identical=2762
#average=0.79 min=0.75 max=0.83
#targets=cluster.fps
start end count percent
0.0 0.1 0 0.000
0.1 0.2 0 0.000
0.2 0.3 0 0.000
0.3 0.4 329 2.432
0.4 0.5 315 2.328
0.5 0.6 3561 26.319
0.6 0.7 1731 12.794
0.7 0.8 192 1.419
0.8 0.9 5 0.037
0.9 1.0 4635 34.257
1.0 1.0 2762 20.414
You might also have noticed that the first line, which describes the format, now has “identity-bin=1”, which indicates the output format is slightly different.
So far you’ve seen simhistogram’s default output format, containing
metadata in addition to the histogram data. There are two alternative
output formats, “csv” and “tsv”, for comma-separated and tab-separated
values, respectively. It can be specified using --out, or if the
--output /` ‘-o`’ output filename ends with “.csv” or “.tsv”, as in
the following:
% chemfp simhist cluster.fps --bins 10 -o cluster.tsv
% cat cluster.tsv
start end count percent
0.0 0.1 0 0.000
0.1 0.2 0 0.000
0.2 0.3 0 0.000
0.3 0.4 329 2.432
0.4 0.5 315 2.328
0.5 0.6 3561 26.319
0.6 0.7 1731 12.794
0.7 0.8 192 1.419
0.8 0.9 5 0.037
0.9 1.0 7397 54.671
%
% chemfp simhist cluster.fps --bins 10 --identity-bin --out csv
start,end,count,percent
0.0,0.1,0,0.000
0.1,0.2,0,0.000
0.2,0.3,0,0.000
0.3,0.4,329,2.432
0.4,0.5,315,2.328
0.5,0.6,3561,26.319
0.6,0.7,1731,12.794
0.7,0.8,192,1.419
0.8,0.9,5,0.037
0.9,1.0,4635,34.257
1.0,1.0,2762,20.414
The “csv” example includes the --identity-bin in the output, which
makes it easier to export the data to, for example, a spreadsheet in
case you need that additional information.
Increasing histogram accuracy
In this section you’ll learn how to change the sampling size and to compute the histogram using the full similarity matrix.
By default simhistogram processes up to 25,000 pairs to generate the similarity histogram, either by computing all scores if there are 25,000 pairs or less, or by random sampling (without replacement) if there are more than 25,000 pairs.
The previous section sampled 25,000 pairs out of 2.9 trillion possible
pairs in ChEMBL34 to generate the histogram. Is that enough? I’ll try
with 1 million samples, which takes long enough that the progress bar
kicks in (use --no-progresss to disable the progress bar):
% chemfp simhistogram --bins 10 chembl_34.fpb --num-samples 1_000_000_000
histogram: 100%|████████████████████████████| 1.00G/1.00G [01:09<00:00, 14.3M/s]
#simhistogram/1 identity-bin=0
#type=sample matrix-type=upper-triangular bins=10 num-samples=1000000000 seed=1378451041
#size=2902289761815 N=2409270
#identical=132
#average=0.11 min=0.06 max=0.16
#targets=chembl_34.fpb
start end count percent
0.0 0.1 414037891 41.404
0.1 0.2 561414295 56.141
0.2 0.3 23589416 2.359
0.3 0.4 822266 0.082
0.4 0.5 98995 0.010
0.5 0.6 25308 0.003
0.6 0.7 7866 0.001
0.7 0.8 2647 0.000
0.8 0.9 911 0.000
0.9 1.0 405 0.000
Even with 1 billion samples (40 times the original 25,000), the average, min, and max are unchanged, and the bin percentages are not much different. The number of digits in the precision is based on the sample size. Also, with 1 billion samples there’s enough to find 405 pairs with a score of at least 0.9, of which 132 have a score of 1.0.
Random samples can seem magical in how few are needed for reasonable accuracy.
If you want more accuracy, use more samples to reduce the variance, and more bins to give tighter bounds to the average. The following uses 1 million samples and 1,000 bins, and I’ll select the “#average” line from the output:
% chemfp simhistogram --bins 1000 chembl_34.fpb --num-samples 1_000_000 \
--no-progress | grep average
#average=0.1108 min=0.1103 max=0.1113
I ran it a few times and saw average values ranging from 0.1107 to 0.1111.
If you really, really want more accuracy, use --all to generate a
histogram from the all similarity scores. To process all pairs in
ChEMBL34 on my laptop it takes about 4.5 hours into 10,000 bins:
% chemfp simhistogram --bins 10000 chembl_34.fpb --all | grep average
histogram: 100%|██████████████████████████████| 2.90T/2.90T [3:55:15<00:00, 206M/s]
#average=0.11075 min=0.11070 max=0.11080
That took 4 hours on my 2020 MacBook Pro M1 laptop. Not bad! Next time I do this I’ll try 100,000 bins to get another digit of precision. Though, really, sampling is powerful! 1 billion samples and 100,000 bins consistently gives an average of 0.110747 ± 2 in the last position, and takes only a minute. Do you need more precision than that?
If you process a big data set like this, take a look at CPU usage. You
might notice simhistogram using more than one core. On my laptop it’s
using 8 OpenMP threads with a load average of 6.5. Use the
--num-threads/-j command-line option to set the number of
threads to use, or set the OMP_NUM_THREADS environment variable to
change the default number of threads for all OpenMP programs.
The keen-eyed among you might noticed it processed over 200 million fingerprints per second, while sampling processes about 14 million per second. This is because 1) the sampling algorithm which determines the pairs to evaluate adds overhead, and 2) the full search processes fingerprints in sequential order, which is a good fit to how computer hardware works, while sampling is less predictable, so more time is spent waiting to get fingerprint data from RAM rather than significantly faster cache memory.
This means if you have 100,000 pairs (not fingerprints!) then it may
be faster to compute the --all set than to sample 25,000 of
them. I say “may” because the sampling performance depends on quite a
few implementation and hardware details. Rather than hand-waving,
here’s some concrete numbers which shows the break-even point is
around 3600 fingerprints, which is 6.5 million pairs:
# When is it faster to generate a histogram from all N*(N-1)/2
# fingerprint pairs than from a sample of 25,000 pairs?
import time
import chemfp
arena = chemfp.load_fingerprints("chembl_34.fpb")
# label the columns
print(" N N*(N-1)/2 T_sample T_full (times in ms)")
for N in (225, 500, 750, 1000, 1500, 2000, 3000, 3600, 4000, 5000, 6000):
T_sample = T_full = 365*24*60*60 # effective infinity
# Determine the fastest time of 10 different subsets
for _ in range(10):
# Create a subset of N fingerprints
subset = arena.sample(N)
t1 = time.time()
# Generate a histogram using 25,000 samples
chemfp.simhistogram(targets=subset, progress=False)
t2 = time.time()
# Generate a histogram using all N*(N-1)/2 pair.
chemfp.simhistogram(targets=subset, num_samples=-1, progress=False)
t3 = time.time()
T_sample = min(T_sample, t2-t1)
T_full = min(T_full, t3-t2)
# Report for this N
print(f"{N:4d} {N*(N-1)//2:8d} {T_sample*1000:5.2f} {T_full*100:5.2f}")
Here is the output from one of my runs:
N N*(N-1)/2 T_sample T_full (times in ms)
225 25200 1.72 0.02
500 124750 2.46 0.06
750 280875 2.32 0.12
1000 499500 2.90 0.20
1500 1124250 2.58 0.47
2000 1999000 2.76 0.80
3000 4498500 2.13 1.78
3600 6478200 2.60 2.55
4000 7998000 2.32 3.16
5000 12497500 2.51 4.95
6000 17997000 2.55 7.11
Histogram between two data sets
In this section you’ll learn how to generate a histogram between two datasets.
The previous simhistogram examples generated a similarity histogram from pairs of fingerprint in a single dataset. Simhistogram can also generate a histogram from similarities between two datasets, where one fingerprint comes from the first dataset and the other comes from the second.
Note
This documentation was written during the transition from ChEBI 1 to ChEBI 2. Some of the URLs may now be invalid.
I’ll use the ChEBI RESTful API to download three subsets: retinoid (CHEBI:26537), purine (CHEBI:26401), and aminopurine (CHEBI:22527). The following uses the curl command-line tool though I’ve omitted the progress bar for clarity, and assumes a bash (or bash-like) shell.
% CHEBI_CHILDREN=https://www.ebi.ac.uk/chebi/backend/api/public/ontology/all_children_in_path/
% QUERY_ARGS="relation=is_a&has_structure=true&download=true"
% curl -o retinoid.tsv "$CHEBI_CHILDREN?$QUERY_ARGS&entity=CHEBI:26537"
% curl -o purine.tsv "$CHEBI_CHILDREN?$QUERY_ARGS&entity=CHEBI:26401"
% curl -o aminopurine.tsv "$CHEBI_CHILDREN?$QUERY_ARGS&entity=CHEBI:22527"
% wc -l retinoid.tsv purine.tsv aminopurine.tsv
82 retinoid.tsv
1264 purine.tsv
849 aminopurine.tsv
2195 total
Here are the first two lines of retinoid.tsv, with tabs expanded to spaces then folded to 80 columns to fit into this documentation.
% head -2 retinoids.tsv | expand | fold
chebi_accession name ascii_name stars formula charge mass monoisot
opicmass inchikey inchi smiles definition
CHEBI:132173 <i>all</i>-<i>trans</i>-7,8-dihydroretinol all-trans-7,8-di
hydroretinol 3 C20H32O 0 288.475 288.24532 XEMSPUZLYVPKPX-S
HGBQBHBSA-N InChI=1S/C20H32O/c1-16(8-6-9-17(2)13-15-21)11-12-19-18(3)10-7-14
-20(19,4)5/h6,8-9,13,21H,7,10-12,14-15H2,1-5H3/b9-6+,16-8+,17-13+ CC1=C(CC
/C(C)=C/C=C/C(C)=C/CO)C(C)(C)CCC1 A retinoid obtained by formal hydrogenat
ion across the 7,8-double bond of <i>all</i>-<i>trans</i>-retinol
It’s a tab-separated file with columns “chembi_accession”, “name”, “ascii_name”, and more. The key ones for later are “chebi_accession” and “smiles”, which for the first record (on the second line) are CHEBI:132173 and “CC1=C(CC/C(C)=C/C=C/C(C)=C/CO)C(C)(C)CCC1”.
The next step is to create RDKit Morgan fingerprints for each of the TSV files. I’ll use csv2fps to process the SMILES structure in the “smiles” column, and use the “chebi_accession” column for the record identifier.
% chemfp csv2fps --type RDKit-Morgan retinoid.tsv -o retinoid.fps \
--id-column chebi_accession --mol-column smiles --no-progress
% chemfp csv2fps --type RDKit-Morgan purine.tsv -o purine.fps \
--id-column chebi_accession --mol-column smiles --no-progress
% chemfp csv2fps --type RDKit-Morgan aminopurine.tsv -o aminopurine.fps \
--id-column chebi_accession --mol-column smiles --no-progress
If you do that you’ll see about 195 error messages for purine.tsv and 39 for aminopurine.tsv, all of which look like the following:
[15:06:05] Can't kekulize mol. Unkekulized atoms: 5 6 7 8 9 10 11 12 13
Error: RDKit cannot parse the SMILES 'NC(=O)CNc1ncnc2ncnc12', column
'smiles' (#11), record 3, line 4, 'purine.tsv'. Skipping.
I confirmed that RDKit was not able to parse the SMILES from the those columns. It’s a bit strange that the corresponding SDF file, available for download, contains a “<SMILES>” data item with a different SMILES which RDKit can read. I’ve contacted ChEBI for insight. Perhaps it’s fixed by the time you test this out?
So, how many fingerprints are in each file? Chemfp doesn’t (yet?) have a command-line tool for this, but with the FPS file I can count the non-header lines:
% grep -v ^# retinoid.fps | wc -l
81
% grep -v ^# purine.fps | wc -l
1068
% grep -v ^# aminopurine.fps | wc -l
809
I’ll compare the aminopurine and purine records, which should be quite similar as about 80 the purines are aminopurines. To do this I’ll generate a histogram with 10 bins:
% chemfp simhist --queries aminopurine.fps purine.fps --bins 10 --no-progress
#simhistogram/1 identity-bin=0
#type=sample matrix-type=NxM bins=10 num-samples=25000 seed=2095750783
#size=864012 queries=809 targets=1068
#identical=132
#average=0.56 min=0.51 max=0.61
#queries=aminopurine.fps
#targets=purine.fps
start end count percent
0.0 0.1 1627 6.508
0.1 0.2 2521 10.084
0.2 0.3 2015 8.060
0.3 0.4 1740 6.960
0.4 0.5 244 0.976
0.5 0.6 1594 6.376
0.6 0.7 3855 15.420
0.7 0.8 8223 32.892
0.8 0.9 2683 10.732
0.9 1.0 498 1.992
This histogram was generated from 25,000 samples of the 864,012 possible Tanimoto scores where one fingerprint cam from the 809 queries and the other fingerprint came from the 1068 targets.
Of these, 132 fingerprints were identical, and the peak was between 0.7 and 0.8 similarity, which for Morgan fingerprints indicates a pretty high similarity, as we expected.
By comparison, here’s a histogram between the retinoid and purine records:
% chemfp simhist --queries retinoid.fps purine.fps --bins 10 --no-progress
#simhistogram/1 identity-bin=0
#type=sample matrix-type=NxM bins=10 num-samples=25000 seed=3781269869
#size=86508 queries=81 targets=1068
#identical=0
#average=0.05 min=0.00 max=0.10
#queries=retinoid.fps
#targets=purine.fps
start end count percent
0.0 0.1 24731 98.924
0.1 0.2 269 1.076
0.2 0.3 0 0.000
0.3 0.4 0 0.000
0.4 0.5 0 0.000
0.5 0.6 0 0.000
0.6 0.7 0 0.000
0.7 0.8 0 0.000
0.8 0.9 0 0.000
0.9 1.0 0 0.000
Wow, what a difference!
Histogram between two datasets, including input histograms
The previous section showed how to generate a histogram between two datasets. In this section you’ll learn how to also include the symmetric histograms for the input query and target fingerprints.
Let’s go back to the purine vs. aminopurine comparison, where the
second is a proper subset of the first. I’ll use 20 bins from
--all pairs. I want to compare the distribution of the inter-group
comparison with the distribution of each intra-group comparison, so
I’ll also specify --include-inputs so simhistogram also computes
the histograms for the aminopurines against itself, and the purines
against itself.
% chemfp simhist --queries aminopurine.fps purine.fps --bins 20 \
--all --include-inputs --no-progress
#simhistogram/1 identity-bin=0
#type=all matrix-type=NxM bins=20
#size=864012 queries=809 targets=1068
#identical=3945
#average=0.558 min=0.533 max=0.583
#queries-type=all matrix-type=upper-triangular bins=20
#queries-size=326836 N=809
#queries-identical=1568
#queries-average=0.658 min=0.633 max=0.683
#targets-type=all matrix-type=upper-triangular bins=20
#targets-size=569778 N=1068
#targets-identical=1596
#targets-average=0.484 min=0.459 max=0.508
#queries=aminopurine.fps
#targets=purine.fps
start end count percent queries_count queries_percent targets_count targets_percent
0.00 0.05 16633 1.925 3020 0.924 15569 2.732
0.05 0.10 39497 4.571 11952 3.657 31639 5.553
0.10 0.15 44877 5.194 13449 4.115 34254 6.012
0.15 0.20 41054 4.752 5546 1.697 38492 6.756
0.20 0.25 41182 4.766 4710 1.441 40639 7.132
0.25 0.30 30579 3.539 353 0.108 34152 5.994
0.30 0.35 35856 4.150 8 0.002 38967 6.839
0.35 0.40 25921 3.000 23 0.007 28904 5.073
0.40 0.45 7201 0.833 16 0.005 9868 1.732
0.45 0.50 1847 0.214 111 0.034 3487 0.612
0.50 0.55 9493 1.099 4100 1.254 6528 1.146
0.55 0.60 46508 5.383 22633 6.925 24523 4.304
0.60 0.65 41601 4.815 20516 6.277 21521 3.777
0.65 0.70 88731 10.270 44299 13.554 44690 7.843
0.70 0.75 155404 17.986 77700 23.773 77891 13.670
0.75 0.80 129543 14.993 64770 19.817 64895 11.390
0.80 0.85 67473 7.809 33736 10.322 33798 5.932
0.85 0.90 24207 2.802 12102 3.703 12122 2.127
0.90 0.95 10296 1.192 5142 1.573 5159 0.905
0.95 1.00 6109 0.707 2650 0.811 2680 0.470
The --include-inputs output contains additional metadata lines
with information about the overall query-query comparisons and the
target-target comparisons, and columns with the corresponding
histogram bin information, both as raw counts and as scaled
percentages.
The total number of pairs is different for each sample so it’s easier to compare the percentages than the raw counts. For example, all three histograms peak in the range 0.70-0.75, and it’s pretty easy to see the queries-queries histogram (the aminopurines) has a higher peak than the other two.
The reported averages in the header shows the aminopurine intra-group average (the queries) is 0.658, the purine intra-group average (the targets) is 0.484, and the aminopurine-purine average is 0.558. This tells me the aminopurines cluster more tightly than purines, which makes sense.
To visualize the data, I copy&pasted it into a spreadsheet then manually configured it to generate the following:
What I see are three bimodal distributions. The values between 0.0 and 0.4 look like a random background distribution, with little similarity between the two other than the “purine” or “aminopurine” core, likely as part of a larger molecule.
The values around 0.75 show the aminopurines are more compact than the purines, and the aminopurine x purine comparison suggests the purine self-similarity is primarily due to the aminopurine self-similarity.
If you just want the table data in TSV or CSV without the headers then
use --out tsv or --out csv, or save the --output/-o to
a filename with the extension “tsv” or “csv”, respectively. For
example, the following shows the first 10 lines (the column headers
and the first 9 bins) out of the default 100 bins, generated using all
of the pairs between the aminopurine and purine data sets, as
comma-separated values (csv):
% chemfp simhist --queries aminopurine.fps purine.fps --all --out csv | head
0.00,0.01,24,0.003
0.01,0.02,237,0.027
0.02,0.03,2891,0.335
0.03,0.04,8331,0.964
0.04,0.05,5150,0.596
0.05,0.06,4029,0.466
0.06,0.07,5180,0.600
0.07,0.08,7874,0.911
0.08,0.09,11067,1.281
Butina on the command-line
In this section you’ll learn how to use the chemfp butina subcommand for Butina clustering. (See
Butina clustering for an example using the chemfp.butina() API
function. The chemfp 4.1 release notes have additional examples.)
Butina clustering (often called Taylor-Butina clustering, though Taylor didn’t apply the algorithm to clustering) first generates the NxN Tanimoto similarity array at a given threshold then generates clusters. It orders the fingerprints by neighbor count, so the fingerprints with the most neighbors come first. It selects the first ranked fingerprint, which is the first centroid, and includes its neighbors as part of the cluster. These fingerprints are removed from future consideration. The selection process then repeats until done, with a number of variations for how to define “done” and how to handle singletons.
The Butina algorithm is available as a subcommand of chemfp. Using a small set of fingerprints as an example:
% cat distinct.fps
1200ea4311018100 id1
894dc00a870b0800 id2
82b180a0943d0254 id3
d04a450013d8a479 id4
8025719c2c907c2e id5
918125cf96a2a360 id6
6b0d4215ca47fc03 id7
2eb000ce2e4337ef id8
849a6de873554c59 id9
b67121e96f48a365 id10
2ab0f5b64cb4b8cf id11
e7fedcefa4e7560e id12
I’ll cluster at a threshold of 0.41:
% chemfp butina distinct.fps -t 0.41
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=158559300 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i center_id count members
1 id8 5 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256
2 id9 2 id9 1.0000 id4 0.4103
3 id1 1 id1 1.0000
4 id2 1 id2 1.0000
5 id6 1 id6 1.0000
6 id7 1 id7 1.0000
7 id3 1 id3 1.0000
This output follows the usual chemfp output format style where the first line describes the format, followed by header lines starting with ‘#’, then the output data proper, using tab-separated columns.
The default uses a variable number of columns, with one line per cluster. The first column is the column id (an integer), the second is the fingerprint id for the center fingerprint, and the third is the number of elements in the cluster (including the center).
After the count are the id and score of every fingerprint in the
cluster, starting with the cluster center. These can be excluded using
--no-members:
% chemfp butina distinct.fps -t 0.41 --no-members --seed 158559300
#Centroid/1 include-members=0
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=158559300 \
false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i center_id count
1 id8 5
2 id9 2
3 id1 1
4 id2 1
5 id6 1
6 id7 1
7 id3 1
The “type” header shows some of the available options, like how to break ties or what to do with “false singletons” (these are cluster centers which no longer have neighbors because they were all assigned to another cluster). See chemfp butina --help for details.
Pruning the number of Butina clusters
Use --num-clusters or simply -n option to prune the clusters
down to a smaller value. In the following I’ve specified I want only 3
clusters, so four of the clusters will be pruned, and their
fingerprints moved to other clusters:
% chemfp butina distinct.fps -t 0.41 --num-clusters 3
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=3257902857 false-singletons=follow-neighbor num-clusters=3 rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
i center_id count members
1 id8 8 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256 id3 0.2381 id9 0.2917 id4 0.1702
2 id6 2 id6 1.0000 id1 0.2353
3 id7 2 id7 1.0000 id2 0.2973
Alternate Butina output formats
In this section you’ll learn about some of the Butina output formats and options. This uses the distinct.fps file from the previous section with 12 fingerprints.
The butina command includes a few options which control the output format, like the following which does not include the metadata (the lines starting with “#”), and which only outputs the cluster centers. It also specifies the initial seed for random number generation:
% chemfp butina distinct.fps -t 0.41 --no-members --no-metadata --seed 19573998
i center_id count
1 id8 5
2 id9 2
3 id3 1
4 id7 1
5 id1 1
6 id2 1
7 id6 1
The --out option supports several alternative formats. The “flat”
variant of the centroid format writes one output line for each
fingerprint, with the centroid id in the first column, the fingerprint
id in the second, the cluster element label in the third, and the
score in the fourth:
% chemfp butina distinct.fps -t 0.41 --out flat --seed 19573998
#Centroid-flat/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.2
#fingerprints=distinct.fps
centroid id type score
5 id1 CENTER 1.0000
6 id2 CENTER 1.0000
3 id3 CENTER 1.0000
2 id4 MEMBER 0.4103
1 id5 MEMBER 0.3256
7 id6 CENTER 1.0000
4 id7 CENTER 1.0000
1 id8 CENTER 1.0000
2 id9 CENTER 1.0000
1 id10 MEMBER 0.4651
1 id11 MEMBER 0.4130
1 id12 MEMBER 0.4400
The -no-rename option tells the “flat” output to use the full
internal labels for each fingerprint, which in this case shows that
“id5” was a false singleton moved to cluster 1, and “id1”, “id2”,
“id3”, and “id4” were moved by the pruning step:
% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
--seed 3601395311 --no-metadata --out flat --no-rename
centroid id type score
2 id1 MERGED 0.2353
3 id2 MERGED 0.2973
1 id3 MERGED 0.2381
1 id4 MERGED 0.1702
1 id5 MOVED_FALSE_SINGLETON 0.3256
2 id6 CENTER 1.0000
3 id7 CENTER 1.0000
1 id8 CENTER 1.0000
1 id9 MERGED 0.2917
1 id10 MEMBER 0.4651
1 id11 MEMBER 0.4130
1 id12 MEMBER 0.4400
The “csv” and “tsv” formats are comma- and tab-separated formats which are a hybrid between the “centroid” and “flat” formats. They have one output line for each fingerprint, ordered by cluster number (largest first) and further ordered by score, with the cluster center always first.
The output fields are the cluster id, the fingerprint id, the number
of elements in the cluster, the fingerprint type (“CENTER”, “MEMBER”,
and more if using --no-rename), and the score.
% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
--seed 3601395311 --out csv
cluster,id,count,type,score
1,id8,8,CENTER,1.0000
1,id10,8,MEMBER,0.4651
1,id12,8,MEMBER,0.4400
1,id11,8,MEMBER,0.4130
1,id5,8,MEMBER,0.3256
1,id3,8,MEMBER,0.2381
1,id4,8,MEMBER,0.1702
1,id9,8,MEMBER,0.2917
2,id6,2,CENTER,1.0000
2,id1,2,MEMBER,0.2353
3,id7,2,CENTER,1.0000
3,id2,2,MEMBER,0.2973
The output was formatted using Python’s csv module , with the “excel” and “excel-tab” dialects, respectively, which means the results should be easy to import into Excel. The writer may use special quoting rules should the identifier contain a comma, newline, or other special character.
Butina parameter tuning with npz files
In this section you’ll learn how to use an npz file to store an NxN similarity array so does not need to be re-computed as you tune the Butina parameters.
The Butina parameters almost certainly require some tuning to find the appropriate threshold, and to decide if cluster pruning is appropriate.
The Butina algorithm is quite fast. It takes just a few seconds, in the default case, to cluster the NxN similarity matrix from the 2+ million fingerprints in ChEMBL. Most of the time is spent formatting the output correctly!
On the other hand, even with chemfp it can take 30 minutes or even a couple of hours to generate the matrix in the first place - making tuning a practice in patience!
This is where simarray’s “npz” format comes in handy. You can generate the NxN matrix once, using the lower-bound similarity threshold for your search space, and save the result to an npz file. This pre-computed matrix can then be used as input to Butina clustering. All chemfp need to do then is count the number of entries at or above the Butina threshold, which is must faster then computing milions of Tanimoto calculations.
(Note: currently only a “no-diagonal” NxN matrix is supported. A future version may allow a full NxN matrix. Also, the entries should be ordered by decreasing score. If they are not in that order, chemfp will reorder them.)
You can use the simsearch command to
generate the npz file but it’s easier to use the --save-matrix
option of the Butina command. In the following I’ll write the output
to /dev/null and I’ll show timing details for the individual steps:
% chemfp butina --threshold 0.4 benzodiazepine.fps.gz \
--save-matrix benzodiazepine.npz --times -o /dev/null
load_fingerprints 0.06 NxN 0.50 save_matrix 2.79 cluster 0.00
rescore 0.00 output 0.01 total 3.43
What took the longest was the time to write the matrix to an npz file. The relatively low threshold of 0.4, combined with the high similarity of the benzodiazepines, mean there are 9,541,836 non-zero entries out of 153,400,610 possible non-zero values (the diagonals are not included), giving a density of 6.2%.
I can now cluster with the saved NxN matrix instead of using fingerprints, first, using all of the entries in the matrix:
% chemfp butina benzodiazepine.npz --times -o /dev/null
load_matrix 0.25 cluster 0.00 output 0.01 total 0.34
and then using --butina-threshold to change the minimum similarity
threshold for the Butina algorithm:
% chemfp butina benzodiazepine.npz --times -o /dev/null \
--butina-threshold 0.5
load_matrix 0.25 cluster 0.00 output 0.02 total 0.35
Number of Butina clusters in ChEMBL
In this subsection I’ll show an example of using the same simsearch npz result as input to Butina, to show how the number of clusters depends on the Butina threshold.
Those timing numbers probably don’t look that impressive for the small benzodiazepine data set. I’ll switch to “chembl_34_60.npz”, which contains the NxN similarity matrix for CheMBL 30 at threshold of 0.60 using RDKit’s default Morgan2 fingerprints, and see how the number of clusters changes for for 0.60, 0.61, 0.62, … 0.70.
This uses the licensed FPB file chembl_34.fpb, which includes an embedded chemfp license key so it can be used even under the terms of chemfp’s base license agrement.
The first step is to compute the matrix for the lowest desired Butina
threshold of 0.6. This can be done as part of the first Butina
calculation (using --save-matrix), but I’ll do it with the
simarray command-line program:
% simsearch --NxN chembl_34.fpb --threshold 0.6 \
--ordering decreasing-score -o chembl_34_60.npz
queries: 100%|██████████████████| 2409270/2409270 [5:04:10<00:00, 132.01 fps/s]
5 hours on my laptop, while doing other things, ain’t too shabby.
I use the --ordering option to improve the Butina clustering
performance by sorting the hits so those with high similarity scores
come before those with lower similarity scores. Otherwise, the hits in
a simsearch threshold do not have a specific order (it depends on the
fingerprint ordering and internal chemfp implementation details.)
The Butina algorithm needs the count of all neighbors within a specified similarity. This can be done in O(N) time if the hits are unsorted, but in O(log(N)) time if in decreasing-score order by using a binary search. You probably realize that a comparison sort requires O(N log(N)) time to sort the hits, which is worse then O(N). It’s a good thing then that chemfp uses a radix sort, with O(N) performance for similarity scores.
With the --ordering option, the simsearch output is appropriately
sorted, so the Butina analysis does not need to re-sort it each time.
Now that the simsearch array is (finally) generated, I’ll run Butina clustering for a threshold of 0.6 and get the last line. I’ll also have it report some internal timings:
% chemfp butina chembl_34_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 2.55 cluster 0.22 output 2.36 total 5.50
428230 CHEMBL4992339 1 CHEMBL4992339 1.0000000
The output order is a bit odd because the times line written to stderr occurs before the stdout output lines piped through tail, but easy enough to figure out.
This says it took about 2.5 seconds to read the file, 0.2 seconds to cluster, and 2.4 seconds to write the 428,230 lines of output.
The clustering time is quite quick - it takes 10 times longer to read the array and to format and print the output! If you really wanted to get the cluster count you could write a program using the chemfp API, but it’s quicker to run the commands and wait the extra few seconds:
% chemfp butina chembl_34_60.npz --butina-threshold 0.60 --no-progress | tail -1
428398 CHEMBL2228156 1 CHEMBL2228156 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.61 --no-progress | tail -1
462162 CHEMBL300581 1 CHEMBL300581 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.62 --no-progress | tail -1
491495 CHEMBL5016448 1 CHEMBL5016448 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.63 --no-progress | tail -1
524751 CHEMBL3601495 1 CHEMBL3601495 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.64 --no-progress | tail -1
557533 CHEMBL549663 1 CHEMBL549663 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.65 --no-progress | tail -1
593861 CHEMBL4648644 1 CHEMBL4648644 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.66 --no-progress | tail -1
633683 CHEMBL3197889 1 CHEMBL3197889 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.67 --no-progress | tail -1
675880 CHEMBL4908000 1 CHEMBL4908000 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.68 --no-progress | tail -1
718190 CHEMBL5009625 1 CHEMBL5009625 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.69 --no-progress | tail -1
766726 CHEMBL4553024 1 CHEMBL4553024 1.0000000
% chemfp butina chembl_34_60.npz --butina-threshold 0.70 --no-progress | tail -1
811376 CHEMBL4562579 1 CHEMBL4562579 1.0000000
I used --no-progress to disable the progress bars, which otherwise
appear after 0.5 seconds. You can use --progress or
--no-progress to always enable or diable them, or set the
environment variable “CHEMFP_PROGRESS” to “1” or “0”.
I also have the comparison matrix in unordered form (technically it’s an implementation-dependent ordering), which lets us see the Butina warning message that the arbitrary array needs resorting, and see how sorting adds an additional 1.2 seconds to the analysis:
% chemfp butina chembl_34_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 2.42 cluster 0.20 output 2.29 total 5.33
428488 CHEMBL4584568 1 CHEMBL4584568 1.0000000
% chemfp butina chembl_34_60_arbitrary.npz --butina-threshold 0.6 --times | tail -1
Warning: Matrix not in 'decreasing-score' order. Re-sorting.
load_matrix 2.57 sort 1.19 cluster 0.20 output 2.33 total 6.77
428398 CHEMBL4943518 1 CHEMBL4943518 1.0000000
The output differences are due to how chemfp’s Butina implementation
handles ties. For example, if two different fingerprints have the same
number of neighbors, which should be picked as the next centroid?
Chemfp by default randomizes the order, so the results will likely be
different for each run. Use --tiebreaker to specify that the first
or last centroid instead.
Sphere exclusion
In this section you’ll learn how to use sphere exclusion to sample
diverse fingerprints from a data set. You will need a copy of
chembl_34.fpb. (See Sphere exclusion
for an example of sphere exclusion using the chemfp.spherex()
API function.)
Sphere exclusion (see Hudson et al. https://doi.org/10.1002/qsar.19960150402) is a form of diversity sampling. It picks an initial fingerprint at random then remove it and all fingerprints within some threshold of similarity form it (that is, all fingerprints within the sphere set by the threshold).
The sphere exclusion prevents future picks will not be similar to previous picks, which is why this is a form of diversity selection.
Sphere exclusion is available on the command-line using the chemfp spherex subcommand. By default it uses a threshold of 1.0 (so only exact matches are excluded) and it keeps picking until all of the fingerprints have been picked.
The following example picks 10 fingerprints using a similarity threshold of 0.4:
% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4
#Spherex/1
#type=spherex threshold=0.4 num-picks=10 randomize=1 seed=616321679
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T12:06:11+00:00
i pick_id count
1 CHEMBL465089 57
2 CHEMBL249957 131
3 CHEMBL4560417 526
4 CHEMBL549096 172
5 CHEMBL427558 1494
6 CHEMBL360197 46
7 CHEMBL47199 111
8 CHEMBL3655944 197
9 CHEMBL89751 79
10 CHEMBL2087345 82
This format follows chemfp’s usual form, with a metadata section (the lines starting with a #) followed by a data section. The first line describes the format type , followed by metadata lines containing key/value pairs about the search. For example, the “type” line describes the applicable sphere exclusion parameters, which can be used to reproduce the results.
In particular, spherex initially randomizes the fingerprint order. If
the seed isn’t specified then it will use Python’s random module to get the
initial seed, which is reported in the type. The following specifies
the --seed, to show that the result is reproducible:
% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 --seed 616321679 \
--no-metadata
i pick_id count
1 CHEMBL465089 57
2 CHEMBL249957 131
3 CHEMBL4560417 526
4 CHEMBL549096 172
5 CHEMBL427558 1494
6 CHEMBL360197 46
7 CHEMBL47199 111
8 CHEMBL3655944 197
9 CHEMBL89751 79
10 CHEMBL2087345 82
I used --no-metadata to omit the header section. There’s also
--no-randomize, which will start with the first fingerprint in the
data set. Since the fingerprints are sorted by popcount, this means
fingerprints with the fewest set bits will be picked first, which
hasn’t
The data comes after the header, in a tab-separated column format. The first line after the metadata section contains the column titles, which in this case are “i” (the pick number), “pick_id” (the id for the picked fingerprint) and “count” (the number of fingerprints in the exclusion sphere, including the pick itself).
Sphere exclusion output options
In this section you will learn about some of the spherex options which affect output, like writing output in CSV format and changing what values are reported.
Chemfp uses its own output format so it can includes metadata in the
output. You can use --no-metadata, as in the previous section, to
omit that section, which leaves simple tab-separated format.
However, if you need an Excel-compatible tab-separated or format then you must specify the “tsv” output format, because the identifer may contain a special character, like a double quote (or comma for the “csv” format), which has a special meaning for Excel:
% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 \
--seed 616321679 --out tsv
pick_id count
CHEMBL465089 57
CHEMBL249957 131
CHEMBL4560417 526
CHEMBL549096 172
CHEMBL427558 1494
CHEMBL360197 46
CHEMBL47199 111
CHEMBL3655944 197
CHEMBL89751 79
CHEMBL2087345 82
Use the “centroid” output format so the output uses the same format as the Butina output:
% chemfp spherex chembl_34.fpb -n 10 --threshold 0.4 \
--seed 616321679 --out centroid
#Centroid/1 include-members=0
#type=spherex threshold=0.4 num-picks=10 randomize=1 seed=616321679
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T13:58:23+00:00
i center_id count
1 CHEMBL465089 57
2 CHEMBL249957 131
3 CHEMBL4560417 526
4 CHEMBL549096 172
5 CHEMBL427558 1494
6 CHEMBL360197 46
7 CHEMBL47199 111
8 CHEMBL3655944 197
9 CHEMBL89751 79
10 CHEMBL2087345 82
The --include-members option includes all of the sphere members in
the output. For the “spherex” and “centroid” output these are all on
one line, after the other columns, alternating between hit id and hit
score:
% chemfp spherex chembl_34.fpb -n 4 --threshold 0.75 --seed 4 \
--include-members --no-metadata
i pick_id count hits
1 CHEMBL1643627 1 CHEMBL1643627 1.0000000
2 CHEMBL1703947 2 CHEMBL1703947 1.0000000 CHEMBL1304130 0.7800000
3 CHEMBL400879 1 CHEMBL400879 1.0000000
4 CHEMBL3954540 3 CHEMBL3954540 1.0000000 CHEMBL3936358 0.7500000 CHEMBL3964946 0.7500000
For the “tsv” and “csv” output each match is written on its own line, with the pick id, the hit id, and the hit score:
% chemfp spherex chembl_34.fpb -n 4 --threshold 0.75 --seed 4 \
--include-members --out csv
pick_id,hit_id,hit_score
CHEMBL1643627,CHEMBL1643627,1.0000000
CHEMBL1703947,CHEMBL1703947,1.0000000
CHEMBL1703947,CHEMBL1304130,0.7800000
CHEMBL400879,CHEMBL400879,1.0000000
CHEMBL3954540,CHEMBL3954540,1.0000000
CHEMBL3954540,CHEMBL3936358,0.7500000
CHEMBL3954540,CHEMBL3964946,0.7500000
Sphere exclusion with initial picks, and saving candidates
In this section you’ll learn how to specify the initial sphere exclusion picks, and how to save the remaining unpicked fingerprints. You will need a copy of chembl_34.fpb.
Sometimes you know which fingerprints to use a cluster centers,
perhaps because of some other selection method, or because there are
specific regions you want to exclude before processing. One option is
to specify a --pick-id for each id to process:
% chemfp spherex chembl_34.fpb -n 5 -t 0.4 --seed 4 \
--pick-id CHEMBL113 --pick-id CHEMBL3977198
#Spherex/1
#type=spherex num-initial-picks=2 threshold=0.4 num-picks=5 randomize=1 seed=4
#num_bits=2048
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T13:32:33+00:00
i pick_id count
1 CHEMBL113 420
2 CHEMBL3977198 1019
3 CHEMBL1715446 36
4 CHEMBL1741912 320
5 CHEMBL5193521 73
The other is to list the initial pick ids in a file, with one id per
line, and specify the filename with --pick-id-file:
% cat centers.txt
CHEMBL113
CHEMBL3977198
% chemfp spherex chembl_34.fpb -n 5 -t 0.4 --seed 4 \
--pick-id-file centers.txt --out csv
pick_id,count
CHEMBL113,420
CHEMBL3977198,1019
CHEMBL1715446,36
CHEMBL1741912,320
CHEMBL5193521,73
After sphere picking finishes, the remaining candidate fingerprints
can be saved to the --save-candidates file, which lets you use
sphere exclusion to carve out the specified ids, and all fingerprints
similar to those ids, though make sure that -n matches the number
of initial picks.
Directed sphere exclusion
In this section you’ll learn how to do directed sphere exclusion. You
will need a copy of chembl_34.fpb and,
one of the examples requires the RDKit. (See Directed sphere exclusion
for an example of directed sphere exclusion using the
chemfp.spherex() API function.)
Regular (undirected) sphere exclusion selects each pick at random from the remaining fingerprints. Directed sphere exclusion uses some other method to select the next pick.
The DISE paper (see Gobbi and Lee, https://doi.org/10.1021/ci025554v) suggests directing the picks based on their similarity to three reference drug structures, that is, pick the fingerprint which is closest to the first structure (sildenafil citrate), with ties broken by similarity to the second (isradipine), and then third (benazepril hydrochloride), with any remaining ties broken at random.
This biases selection towards those drug-like structures, while using sphere exclusion to improve diversity over similarity ranking.
Use the --dise flag for directed sphere exclusion following the
DISE paper (note: this requires RDKit, to generate the appropriate
fingerprints from the reference structure SMILES):
% chemfp spherex --dise chembl_34.fpb -n 5 -t 0.4 --seed 4 --out csv
pick_id,count
CHEMBL1737,318
CHEMBL195236,44
CHEMBL4539785,34
CHEMBL2017055,17
CHEMBL460085,43
Use --dise-references to specify your own reference fingerprints
or reference structures.
Alternatively use a ranking file (with --ranks) to specify the
rank order directly. This may in one of three styles, based on the
filename extension or value of --rank-format.
The “tsv” format contains two tab-separated columns, where the first is the identifier and the second is the rank score (which must be a non-negative integer or float):
% cat ranks.tsv
CHEMBL113 123
CHEMBL2017055 888
CHEMBL1737 999
CHEMBL4539785 888
% chemfp spherex --ranks ranks.tsv chembl_34.fpb -n 5 -t 0.4 \
--seed 4 --out csv
CHEMBL113,420
CHEMBL2017055,147
CHEMBL4539785,87
CHEMBL1737,135
CHEMBL17564,1
(Use --ranks-has-header to skip the first line of the ranks file, in
case it contains a header.)
If no rank is specified, the value of --ranks-default is used,
which in turn defaults to 4294967295.
The “txt” ranking format contains one identifier per line, which will
be processed in order. This is subtly different from the
--pick-id-file because the pick file will always pick a
fingerprint, even if it is in the excluded sphere of another
fingerprint, and because of processing differences between unpicked
and unranked fingerprints.
Finally, the ranks may be a fingerprint file (eg, in “fps” or “fpb”) format, in which the fingerprints are ranked by the order in the file, so the first fingerprint come first, etc.
MaxMin diversity selection
In this section you’ll learn how to do MaxMin diversity selection
using the chemfp maxmin command-line tool. You will need a
copy of chembl_34.fpb. (See
Select diverse fingerprints with MaxMin for an example of MaxMin using the
chemfp.maxmin() API function.)
MaxMin (see Ashton, et al., doi:10.1002/qsar.200290002) is a method for selecting diverse fingerprints. It starts with an initial fingerprint selected from a set of candidate fingerprints. (See below for more about the initial pick.) It then finds a candidate fingerprint which is the least similar to the initial pick, that is, a fingerprint from the set of fingerprints with the lowest Tanimoto similarity to the initial pick. This candidate is added to the picked fingerprints.
The third pick is a fingerprint from the set of fingerprints which have the lowest similarity to either the initial pick and the second pick. (This algorithm is called “MaxMin” because it’s looking for a candidate fingerprint whose maximum similarity to any of the picked fingerprints is the minimum.)
This process repeats until no candidates are left, or enough fingerprints have been picked, or the minimum similarity threshold is too high, or the user gives up and stops the program.
Here are 5 of the most diverse compounds in ChEMBL 34:
% chemfp maxmin chembl_34.fpb -n 5 --times
#Diversity/1
#num_bits=2048
#type=maxmin threshold=1.0 num-picks=5 all-equal=0 randomize=1 seed=4185699608
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-20T14:45:26+00:00
i pick_id score
1 CHEMBL4297424 0.0000000
2 CHEMBL4541723 0.0000000
3 CHEMBL2006665 0.0000000
4 CHEMBL1201084 0.0000000
5 CHEMBL1200727 0.0000000
T_init: 0.04 T_pick: 14.93 #picks: 5 picks/s: 0.33 T_total: 15.00
They all have a similarity score of 0.0 with each other, and the total calculation took 15 seconds, nearly all of which was spent in the MaxMin pick algorithm. (As we’ll see in a bit, nearly all of that was spent making the first pick.)
The output format, like most chemfp formats, is line-oriented, with a header section containing lines starting with a “#” followed by a data section containing lines which do not start with a “#”.
The key/value metadata in the header describes the MaxMin processing. For example, the “type” line describes the applicable MaxMin parameters, which can be used to reproduce the results, and the “candidates” line gives a hint about the source of the candidate fingerprints.
The data lines are tab-separated columns. In this case there are three columns, whose column titles are in the first data line. The first column (“i”) contains the pick index, the second (“pick_id”) the identifier for the picked fingerprint, and the third (“score”) the maximum similarity score between the pick and all previous picks.
By definition the first pick always has a score of 0.0, and the score column will never decrease.
The times in this output (the last line, which was send to stderr), report the time to initialize the internal MaxMin data structure, the total time for picking, the average time for picking, and the total MaxMin time (minus the Python startup time).
Note
You should filter out all unreasonable fingerprints before using MaxMin, because the most diverse compounds in datasets like ChEMBL are often least drug-like compounds.
The initial MaxMin pick
In this section you’ll learn how MaxMin does the initial pick, how to specify the initial, and learn about a handy rule-of-thumb. You will need a copy of chembl_34.fpb.
The --pick-time option adds a column to the MaxMin output with the
total elapsed time for each pick:
% chemfp maxmin chembl_34.fpb -n 5 --pick-time --out tsv --times
pick_id score pick_time
CHEMBL1796997 0.0000000 4.15
CHEMBL3181975 0.0000000 4.15
CHEMBL1909275 0.0000000 4.16
CHEMBL4298415 0.0000000 4.16
CHEMBL4302507 0.0000000 4.16
T_init: 0.02 T_pick: 4.16 #picks: 5 picks/s: 1.20 T_total: 4.20
This says MaxMin took 4.2 seconds, of which 4.15 seconds were needed for the first pick.
Why did the first one take so long?
The MaxMin algorithm starts with an initial seed. Many algorithms pick
this seed at random, which is indeed faster. To see what I mean, the
following uses the --pick-index option to select the initial pick
by index, where the special value -1 means to select the index at
random:
% chemfp maxmin chembl_34.fpb -n 5 --pick-index -1 \
--pick-time --out tsv --times
pick_id score pick_time
CHEMBL5219064 0.0000000 1.46
CHEMBL3988898 0.0000000 1.46
CHEMBL1200731 0.0000000 1.46
CHEMBL4298415 0.0000000 1.46
CHEMBL1098659 0.0000000 1.46
T_init: 0.06 T_pick: 1.46 #picks: 5 picks/s: 3.42 T_total: 1.56
That’s 2.7 seconds faster. So what is chemfp doing in those 2.7 seconds?
It’s very unlikely that a randomly picked seed will be a most diverse fingerprint (a fingerprint whose maximum similarity to every other fingerprint is the lowest). Instead, chemfp uses a novel algorithm, called “heapsweep”, to find a most diverse fingerprint, which takes time.
As an interesting observation, I’ve found that the most diverse
fingerprint is often the one with the fewest bits set, so if the
fingerprints are sorted by popcount (as they are by default in an FPB
file) then using --pick-index 0 is a good first pick - better than
a random pick, even if not guaranteed to be a best pick.
Take CHEMBL1796997 as an example, which took 4.15 seconds to identify as an initial pick. That’s the 4th fingerprint in chembl_34.fpb, while the first one is CHEMBL17564:
% fpcat chembl_34.fpb --no-metadata | \
awk '/CHEMBL1796997$/ {print NR; exit}'
4
% fpcat chembl_34.fpb --no-metadata | awk '{print $2; exit}'
CHEMBL17564
It’s hard to tell if one initial pick is better than the other, because the MaxMin algorithm is an approximate method. At each step the next pick is arbitrarily selected from possibly many candidates with an equal MaxMin score to the existing picks (eg, at random, or based on the fingerprint index), because using a local selection method like this is faster than a method like the full heapsweep algorithm, which globally optimizing the next pick.
To get a sense for the difference, I’ll get the score for the 1,000th pick, using each of the ids, using the following:
% chemfp maxmin -n 1000 --pick-id CHEMBL17564 chembl_34.fpb | tail -1
1000 CHEMBL2074904 0.1458333
After 5 runs the sorted scores for CHEMBL17564 (the first fingerprint) are: 0.1454545, 0.1454545, 0.1456311, 0.1456311, and 0.1458333.
For CHEMBL1796997 they are 0.1456311, 0.1458333, 0.1458333, 0.1460674, and 0.1461538, which are slightly higher.
For a --pick-index -1 (a randomly selected initial seed), the
sorted scores after 5 runs are 0.1454545, 0.1458333, 0.1458333,
0.1459854, and 0.1463415, which is about the same as using heapsweep
for the initial pick.
On the other hand, if we want a sort of “null space”, that is, a set of fingerprints which are all 0.0 to each other, using something like:
% chemfp maxmin -t 0.0 --pick-id CHEMBL17564 chembl_34.fpb | tail -1
50 CHEMBL1233550 0.0000000
then after 10 runs we find a maximum of 56 for CHEMBL17564, a maximum of 58 for CHEMBL1796997, and a maximum of 52 for a randomly selected initial seed.
It can be quite hard to make sense of these results, and I’m not convinced I fully understand it myself.
MaxMin diversity selection including references
In this section you’ll learn how to use MaxMin
to pick diverse fingerprints from a set of candidates which are also
diverse from a set of references. You will need a copy of
chembl_34.fpb, and a copy of
chembl_33.fpb from the same download page. (See Use MaxMin with references
for an example of MaxMin with references using the
chemfp.maxmin() API function.)
The MaxMin algorithm can be applied to select diverse compounds from a vendor catalog (“the candidates”) which are also diverse from a local collection (“the references”).
For example, what fingerprint in ChEMBL 34 is the most diverse from any other fingerprint in ChEMBL 34 and is also diverse from any fingerprint in ChEMBL 33?
% chemfp maxmin -n 5 --references chembl_33.fpb chembl_34.fpb \
--out tsv --pick-time --times
pick_id score pick_time
CHEMBL5308483 0.1666667 4.00
CHEMBL1235452 0.2000000 7.58
CHEMBL5282109 0.2222222 15.50
CHEMBL5267320 0.2307692 20.25
CHEMBL5271563 0.2380952 25.16
T_init: 0.03 T_pick: 25.16 #picks: 5 picks/s: 0.20 T_total: 25.20
To verify the first one, the following gets CHEMBL5308483 from ChEMBL 34 then does a k=1 similarity search with ChEMBL 33 to find that, yes, the maximum similarity score is only 0.1666667:
% fpcat chembl_34.fpb | awk '/CHEMBL5308483$/ {print; exit}' | \
simsearch -k 1 chembl_33.fpb --no-metadata
1 CHEMBL5308483 CHEMBL3185131 0.1666667
Heapsweep diversity selection
In this section you’ll learn how to use the chemfp heapsweap command-line tool for global diversity
picking. You will need a copy of chembl_34.fpb. (See Select diverse fingerprints with Heapsweep for an example of
heapsweep using the chemfp.heapsweep() API function.)
The heapsweep algorithm is a method for diversity selection which successively picks a fingerprint with the smallest maximum similarity to all other fingerprints. Unlike the MaxMin algorithm, which is a local method which compares the candidates to the previously picked candidates, heapsweep which compares the candidates to all other fingerprints.
According to heapsweep, the 10 globally most diverse compounds in ChEMBL 34 are:
% chemfp heapsweep -n 10 chembl_34.fpb
#Diversity/1
#num_bits=2048
#type=heapsweep threshold=1.0 num-picks=10 all-equal=0 randomize=1 seed=281934854
#software=chemfp/4.2
#candidates=chembl_34.fpb
#date=2024-06-25T11:10:47+00:00
i pick_id score
1 CHEMBL1796997 0.0769231
2 CHEMBL4297424 0.0769231
3 CHEMBL4300465 0.0833333
4 CHEMBL1201290 0.0833333
5 CHEMBL1200673 0.0909091
6 CHEMBL4298559 0.0909091
7 CHEMBL1201326 0.1000000
8 CHEMBL1233877 0.1000000
9 CHEMBL4297653 0.1000000
10 CHEMBL2146126 0.1111111
The order might differ between runs because of ties, like how CHEMBL1796997 and CHEMBL4297424 have the same global score of 0.0769231. The heapsweep implementation starts with a few random sweeps, where each sweep selects a fingerprint at random and finds the score to every other fingerprint. The maximum of these scores are used to initialize the heapsweep search queue, which means the queue order depends on the randomly selected sweeps, which affects the output order.
The output format follows the usual chemfp convention of a header, where the first line identifies the format and the remaining header lines contain key/value pairs describing the search parameters and other metadata. The data lines are tab separated columns. The first data line contains the column headers, which in this case are “i” for the pick index, “pick_id” for the picked fingerprint id, and “score” for the global score.
We can verify that the nearest neighbor to CHEMBL1796997 has a similarity score of 0.0769231 using the following:
% fpcat chembl_34.fpb | awk '/CHEMBL1796997$/ {print; exit}' | \
simsearch -k 2 chembl_34.fpb --no-metadata
2 CHEMBL1796997 CHEMBL1796997 1.0000000 CHEMBL1992013 0.0769231
It’s much more difficult to verify that it’s indeed the lowest score, without generating all other scores.
The heapsweep scores are global scores, and cannot be directly compared to the MaxMin scores, which are only computed relative to the previous picks (or 0.0 by definition for the first MaxMin pick).
The heapsweep algorithm is significantly slower than MaxMin. For example, it took nearly 13 minutes for my laptop to find the first 200 most diverse fingerprints in ChEMBL 34, and nearly 34 minutes for the first 1,000:
% chemfp heapsweep -n 1000 chembl_34.fpb --pick-time
...
i pick_id score pick_time
... only showing every 100th line ...
100 CHEMBL4298435 0.2500000 502.10
200 CHEMBL3188899 0.2702703 768.25
300 CHEMBL1522189 0.2857143 992.69
400 CHEMBL1163792 0.2957746 1185.38
500 CHEMBL4747791 0.3037975 1340.15
600 CHEMBL2136537 0.3125000 1523.69
700 CHEMBL4463753 0.3188406 1666.25
800 CHEMBL4062050 0.3246753 1802.26
900 CHEMBL2164954 0.3333333 1960.64
1000 CHEMBL2007107 0.3333333 2019.47
while the equiavlent for maxmin takes 13 seconds for the entire run - that’s more than 150x faster!:
% chemfp maxmin -n 1000 chembl_34.fpb --pick-time
...
i pick_id score pick_time
... only showing every 100th line ...
100 CHEMBL1878565 0.0684932 6.10
200 CHEMBL1096422 0.0957447 7.62
300 CHEMBL1927792 0.1111111 8.78
400 CHEMBL543312 0.1186441 9.51
500 CHEMBL3337642 0.1250000 10.20
600 CHEMBL80560 0.1309524 10.85
700 CHEMBL1321270 0.1363636 11.54
800 CHEMBL575927 0.1397849 12.08
900 CHEMBL1214299 0.1428571 12.58
1000 CHEMBL4749940 0.1458333 13.07
I don’t think you should use heapsweep as a general-purpose diversity selection algorithm.
In practice, most people want picks which are highly diverse from each other. My experiments show the intra-group diversity of the MaxMin picks (because it uses a heuristic optimized for intra-group diversity) is higher than the heapsweep picks (which includes global information).
Furthermore, heapsweep is a single-threaded implementation with a lot of random memory lookups, which are a lot slower than the sequential lookups of simsearch or simarry, both of which are parallelized. It takes simarray less than a day to compute the entire NxN ChEMBL matrix, so if you really want the ordered ranking for a large number of fingerprints then don’t hesitate to go ahead and compute everything.
On the other hand, heapsweep is very useful for picking a globally most diverse fingerprint as the initial seed to MaxMin, because that only takes a few seconds. That’s why chemfp maxmin uses it to pick the initial seed, if none is specified.
If there are global heapsweep ties then MaxMin will choose the initial
seed arbitrarily from them. If you want to explore how MaxMin changes
based on the seed, you can use heapsweep to report all ties by asking
for one pick (-n 1), as well as all other picks with the same
score (--all-equal):
% chemfp heapsweep -n 1 --all-equal chembl_34.fpb --out tsv
pick_id score
CHEMBL1796997 0.0769231
CHEMBL4297424 0.0769231
Now that you know the ids, use chemfp maxmin --pick-id $ID .. to
specify each one as the initial seed.
Structure file format translation
In this section you’ll learn how to use chemfp translate to convert from one structure file format to another.
Chemfp has a cross-toolkit API for reading a stream of toolkit-specific molecules from a structure file. These are used to generate fingerprint from a structure file. It also has a cross-toolkit API for writing molecule to a structure file. It was originally developed to add a fingerprint to a tag data field of an SD record, and later developed as a more general-purpose writer.
The chemfp translate command-line tool puts the reader and writer together to make a structure file converter. The following converts a SMILES file record to an InChI record, and then to an SDF record:
% echo "C methane" | chemfp translate --out inchi
InChI=1S/CH4/h1H4 methane
methane
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
By default it uses the “rdkit” toolkit if available, otherwise
“openbabel”, otherwise “openeye”, otherwise “cdk”. Use the -T
option to specify one or more toolkit names, as a comma-separated
list, to change the order. The following uses the CDK to convert the
methane input into SDF V3000 output:
% echo "C methane" | chemfp translate -T cdk --out sdf3k
methane
CDK 0625241606
0 0 0 0 0 999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1 0 0 0 0
M V30 BEGIN ATOM
M V30 1 C 0 0 0 0
M V30 END ATOM
M V30 END CTAB
M END
$$$$
The default is to read a SMILES file from stdin and write a SMILES
file to stdout. Input and output filenames can be specified on the
command-line, in which case the extensions are used to identify the
format, with SMILES as the default. Use --in and --out to
specify the formats directly.
For example, the following converts the ChEBI SDF to a SMILES. The SDF
title line does not contain a record identifier so it uses the
--id-tag to tell chemfp that the “ChEBI ID” data tag value
contains the title:
% chemfp translate ChEBI_lite_3star.sdf.gz --id-tag "ChEBI ID" -o chebi_lite.smi
... lots of RDKit processing messages ...
% head -2 chebi_lite.smi
Oc1cc(O)c2c(c1)O[C@H](c1ccc(O)c(O)c1)[C@H](O)C2 CHEBI:90
CC1(C)C(=O)[C@@]2(C)CC[C@@H]1C2 CHEBI:165
Structure translation reader and writer args
In this section you’ll learn how to use the -R and -W options
to set the reader and writer arguments to chemfp translate.
Many of the structure readers and writers support configuration options. For example, the SMILES readers have a “delimiter” option which affects how to parse the fields in a line. The default is “to-eol”, which use the first terms for the SMILES or CXSMILES, and the rest of the line as the title, which in the following is “water molecule”:
% echo "O water molecule" | chemfp translate
O water molecule
Use the -R option to set reader arguments, as “key=value”
pairs. The “whitespace” delimiter argument says to split the remaining
terms on the line by whitespace, and use the first term as the id,
which in the following is only “water”:
% echo "O water molecule" | chemfp translate -R delimiter=whitespace
O water
Many of the options are toolkit and format specific. For example, the CDK SMILES reader has a “hydrogens” reader argument which, if “make-explicit”, will make all implicit hydrogens become explicit hydrogens:
% echo "O water" | chemfp translate -T cdk
O water
% echo "O water" | chemfp translate -T cdk -R hydrogens=make-explicit
O([H])[H] water
Writers may have their own arguments, which are specified with
-W. As a simple one, the InChI writers by default include the
record id after the InChI string:
% echo "C methane" | chemfp translate -T openbabel --out inchi
InChI=1S/CH4/h1H4 methane
Use the “include_id” writer argument to configure that option, with “1” for on (the default) and “0” for off:
% echo "C methane" | chemfp translate -T openbabel --out inchi -W include_id=0
InChI=1S/CH4/h1H4
As a fun one, Open Babel’s “ascii” output format writes a structure diagram using ASCII characters. The default width is 79 characters, but I can specify an alternative using a Open Babel “options” string (formatted for use by OBConversion::SetOptions), which sets the width to 20:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | \
chemfp translate -T openbabel --out ascii -W 'options=h"20"'
O
|| /
|| |
\_ |
_/ \_ N
_/ \_ __ \
HN / \__/ \
|| \
| || / /
| || / /
| _/_ / /
_\_ _/ \___ /
__// \_ / N
O / \ N
|
|
|
Note that the double quoting (with single quotes containing double quotes) is used to prevent the shell from parsing the double quote that Open Babel needs when parsing the options string.
Structure translation with two toolkits
In this section you’ll learn how to have chemfp translate use one toolkit to read molecules from source one format then convert them via an intermediate format so they can be written by another toolkit.
This section is mostly for the fun of being able to do it. I would like to know if you actually find it useful.
Suppose you wanted to convert a Maestro record into CML. The RDKit supports Maestro files, but not CML, which is only supported by Open Babel and CDK.
One way, which isn’t so useful, is to use chemfp translate to convert the file (in this case a test file from the RDKit test suite) into SMILES:
% chemfp translate -T rdkit NCI_aids_few.mae | head -5
CCC1=[O+][Cu@]2([O+]=C(CC)CC(CC)=[O+]2)[O+]=C(CC)C1 48
C(=C\c1ccccc1)\C1=[O+][Cu@@]2([O+]=C(/C=C/c3ccccc3)CC(c3ccccc3)=[O+]2)[O+]=C(c2ccccc2)C1 78
CC(=O)N1c2ccccc2Sc2c1ccc1ccccc21 128
Nc1ccc(/C=C/c2ccc(N)cc2[S@](=O)(=O)O)c([S@@](=O)(=O)O)c1 163
O=[S@@](=O)(O)CC[S@@](=O)(=O)O 164
then pipe the SMILES output so CDK can translate the result to CML:
% chemfp translate -T rdkit NCI_aids_few.mae | \
chemfp translate -T cdk -o out.cml
% head -5 out.cml
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
<atomArray>
<atom id="a1" elementType="C" formalCharge="0" hydrogenCount="3"/>
<atom id="a2" elementType="C" formalCharge="0" hydrogenCount="2"/>
I say isn’t so useful because the input contains coordinates which
were lost through the conversion via SMILES. Better would be to use
SDF, preferably as V3000, to preserve as much as possible. I’ll do
this by changing the --out of the first program and --in of
the second to “sdf3k”:
% chemfp translate -T rdkit NCI_aids_few.mae --out sdf3k | \
chemfp translate -T cdk --in sdf3k -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
<atomArray>
<atom id="1" elementType="C" x3="1.397061" y3="-2.611162" z3="-1.31402"
formalCharge="0" hydrogenCount="2"/>
<atom id="2" elementType="C" x3="0.617928" y3="-3.334828" z3="-0.196596"
formalCharge="0" hydrogenCount="3"/>
You likely also noticed the atom order has changed, as a consequence of SMILES canonicalization and/or the way RDKit generated the SMILES spanning tree.
Another issue is error reporting. The first toolkit could fail to parse an input record, or generate the intermediate record, and the second toolkit fail to parse that intermediate record or generate the final output record. If the first toolkit skip records it can’t process, then the record numbers for the first and second toolkit will differ, because the second toolkit knows nothing about the skipped records. This results in potentially confusing error messages.
Take for example this SMILES file, which I’ll convert to InChI. Note that “Q” is not a valid SMILES and “*” cannot be represented in InChI:
% cat test.smi
C first
Q second
* third
O fourth
% chemfp translate -T rdkit test.smi --out sdf3k | chemfp translate -T cdk --in sdf3k --out inchi
[21:08:23] SMILES Parse Error: syntax error while parsing: Q
[21:08:23] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Error: RDKit cannot parse the SMILES 'Q', file 'test.smi', line 2, record #2: first line is 'Q second'. Skipping.
Error: CDK cannot create the InChI string (input title='third'), file '<stdin>', record #2. Skipping.
InChI=1S/CH4/h1H4 first
InChI=1S/H2O/h1H2 fourth
The two “SMILES Parse Error” lines come from the RDKit, from the C++ layer, when parsing the second record.
The “Error: RDKit cannot parse” line come from chemfp, because by default it gives a report about any processing errors, including location information, then goes on to the next reocrd. In this case it says there was an error in record #2 (on line 2) because the SMILES “Q” could not be parsed.
The “Error: CDK cannot create” line also comes from chemfp, again because it reports any processing errors, in this case because it can’t convert the “*” into an InChI. But see how it says the error is again in record #2? That’s because the second input record was skipped, so the CDK stage is actually only processing input records 1, 3, and 4.
The last two lines are the InChIs for the methane and water molecules. Thes were sent to stdout while the other messages were sent to stderr.
Instead, the translate command has built-in support for this two-stage processing, like the following:
% chemfp translate -T rdkit -U cdk test.smi --out inchi
[21:11:50] SMILES Parse Error: syntax error while parsing: Q
[21:11:50] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Error: RDKit cannot parse the SMILES 'Q', file 'test.smi', line 2, record #2: first line is 'Q second'. Skipping.
Error: CDK cannot create the InChI string (input title='third'), file 'test.smi', line 3, record #3: first line is '* third'. Skipping.
InChI=1S/CH4/h1H4 first
InChI=1S/H2O/h1H2 fourth
The -U option specifies the second toolkit to use, so this uses
the RDKit toolkit for input and the CDK toolkit for output. By default
uses “sdf3k” as the intermediate file format, so the MAE to CML
conversion is simply:
% chemfp translate -T rdkit -U cdk NCI_aids_few.mae -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
<atomArray>
<atom id="a1" elementType="C" x3="1.3971" y3="-2.6112" z3="-1.314"
formalCharge="0" hydrogenCount="2"/>
<atom id="a2" elementType="C" x3="0.6179" y3="-3.3348" z3="-0.1966"
formalCharge="0" hydrogenCount="3"/>
Use --via option to change the intermediate format; the following
uses the canonical isomeric SMILES, so the coordinates are no longer
present:
% chemfp translate -T rdkit -U cdk --via smi NCI_aids_few.mae -o out.cml
% head -5 out.cml | fold -s
<?xml version="1.0" encoding="ISO-8859-1"?>
<molecule id="m1" title="48" xmlns="http://www.xml-cml.org/schema">
<atomArray>
<atom id="a1" elementType="C" formalCharge="0" hydrogenCount="3"/>
<atom id="a2" elementType="C" formalCharge="0" hydrogenCount="2"/>
Generating LINGO byte fingerprints
In this section you’ll learn how to generate superimposed LINGO byte fingerprints using chemfp lingo2fps. For more about generating LINGO count fingerprints see Generating LINGO count fingerprints.
Note
A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.
The LINGO fingerprints are based on the work of Vidal, Thormann, and Pons, “LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities”, J. Chem. Inf. Model. (2005) 45, 386-494, doi: 10.1021/ci0496797.
The expected input is a SMILES string, in canonical form. This is used to generate nmer substrings called a k-LINGO, where k is the size. If the input is “CC=O” and k=1 then there are 2 “C”, 1 “=”, and 1 “O” 1-LINGO terms. With k=2 the 2-LINGO terms are 1 “CC”, 1 “C=”, and 1 “=O”.
Chemfp represents the LINGO nmers using 64-bit values using a direct encoding where a k=1 nmer like “C” sets the numeric value 67, because 67 is the ASCII and Unicode position of “C”, and a k=2 nmer like “C=” sets the two-byte value 67*256 + 61 = 17213 where the 67 is again from the “C” and the 61 is the ASCII position of “=”. If you can read hex, this is 0x433d because 0x43 is 67 in hex and 0x3d is 61 in hex. Or, using Python:
>>> int.from_bytes(b"C=")
17213
>>> hex(int.from_bytes(b"C="))
'0x433d'
>>> (17213).to_bytes(2)
b'C='
The Vidal paper used the integral count Tanimoto to compare two LINGO holograms. Grant, Haigh, Pickup, Nicholls, and Sayle, “Lingos, Finite State Machines, and Fast Similarity Searching “, J. Chem. Inf. Model. (2005) 46, pp 1912-1918, doi:10.1021/ci6002152 suggested using the multiset/count Tanimoto. Chemfp uses superimposed coding to convert count fingerprints to binary fingerprint such that the Tanimoto of the binary fingerprints is a good approximation to the count Tanimoto of the original count fingerprints.
The chemfp lingo2fps command-line tool reads a SMILES file as input, identifies the LINGOs (the encoded nmer and its total count), encodes them as superimposed byte fingerprints, and writes the output in FPS or FPB format. It does not require a chemistry toolkit, though one may be needed to create the SMILES input in properly canonicalized form.
The following example processes the SMILES “CC=O” to generate the 1-LINGO terms “=” (occurs 1 time), “C” (occurs 2 times), and “O” (occurs 1 time), which in FPC format is “61,67:2,79”. This information is then superimposed count encoded into a 256-bit/32-byte fingerprint in FPS format.
% echo "CC=O example" | chemfp lingo2fps --num-bits 256 -n 1 --no-date
#FPS1
#num_bits=256
#type=LingoSuperimposed/1 num_bits=256 nmer_size=1 normalize=Closures max_count=1000
#software=chemfp/5.1
0000000000000000001001000000080000000000000000020000000000000000 example
The non-zero fingerprint terms are “1”, “1”, “8”, and “2”, which each correspond to a single bit set for that 4-bit nibble, showing that 4 distinct bits in total – one for each feature and distinct count - have been set.
Due to the random nature of superimposed coding, it is possible for two features to collide. With only 128 bits, two of the bits are mapped to the same output position, resulting in only three bits being set.
% echo "CC=O example" | chemfp lingo2fps --num-bits 128 -n 1 --no-date
#FPS1
#num_bits=128
#type=LingoSuperimposed/1 num_bits=128 nmer_size=1 normalize=Closures max_count=1000
#software=chemfp/5.1
00000000020000000010000000000800 example
By default lingo2fps generates LINGOs using 4-mer substrings (that is, 4-LINGOs), superimposed onto 2048-bit:
% echo "Oc1ccccc1 phenol" | chemfp lingo2fps --no-date | fold -85
#FPS1
#num_bits=2048
#type=LingoSuperimposed/1 num_bits=2048 nmer_size=4 normalize=Closures max_count=1000
#software=chemfp/5.1
0000000000000000000000000000000000000000000000000000000000000000000000000040000000000
0000000010004000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000004000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000010000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000020000000000000000000000000000000000
00 phenol
How many bits are enough enough? The bit density (the percentage of bits which are 1-bits) should be low, in the 5% range or less.
If the SMILES is N bytes long then the total number of features for a k-LINGO is N-k+1. For most datasets, the average SMILES length is about 50-60. For ChEMBL about 6% of structures have N>100, and fewer than 1% of structures have N>250, so 2048 bits seems to be a good default.
Careful observers will correctly notice the LINGO nmer to integer encoding limits chemfp’s LINGO implementation to 8-mers. Grant et al. showed the mean retrieval rate is highest with 4-mers, with no evidence that 6-mers or higher are more appropriate than 4-mers, and likely worse.
LINGO Normalization
The input SMILES strings are not necessarily mapped directly to substring n-mers. Vidal et al. proposed two normalizations: 1) convert all ring closure terms to “0”, and 2) convert “Br” to “R” and “Cl” to “L”. For example, the SMILES “c1ccccc1Br” is normalized to “c0ccccc0R”.
Chemfp supports both normalizations, referred to as “Closures” and “BrCl”, respectively. Grant et al.’s followup work normalized closures but not BrCl. Chemfp follows their lead.
In chemfp, normalization terms can be combined with a “,” or “|” character. The normalization “Closures|Br” or “Closures,Br” gives the Vidal et al. normalization. The default normalization of “Closures” is also available as “Default”. Use “0” for no normalization.
Closure normalization includes the “%nn” syntax for closure values in the range 10-99, as well as the “%(n)” SMILES extension which supports closure values above 99. For example,s “c%12%13” is normalized to “c00” and “C%(103)” is normalized to “C0”.
Superimposed LINGO similarity search
In this section you’ll learn to generate superimposed LINGO byte fingerprints from ChEMBL 36 and search them with simsearch. You’ll also learn the importance of canonicalization. You will need a copy of chembl_36.sdf.gz from the ChEMBL 36 release.
The LINGO tools expect canonical SMILES as input, not SDF or other structure format. One way to generate SMILES from the ChEMBL SD file is to use chemfp translate, as in the following:
% chemfp translate chembl_36.sdf.gz -o chembl_36.smi.gz
Use the -T/--toolkit option to specify the toolkit to use. By
default it tries the toolkits “rdkit”, “openeye”, “openbabel” and
“cdk”, in that order, and uses the first one found. On my machine, the
above translation used RDKit.
Next, generate the LINGO fingerprints from the SMILES file:
% chemfp lingo2fps chembl_36.smi.gz -o chembl_36_lingo.fpb
chembl_36.smi.gz: 100%|████████████████████| 52.9M/52.9M [00:15<00:00, 3.49Mbytes/s]
While the output of translate can be piped directly to lingo2fps instead of storing them in a SMILES file, I’ll be using that SMILES file soon for structure lookup. I will use a Unix pipeline to process a query SMILES through the same LINGO fingerprint generation steps (SMILES canonicalization and conversion to superimposed LINGO byte fingerprints), then use simsearch to read the queries from stdin and search the targets in the specified FPB file:
% echo "c1ccccc1O phenol" | \
chemfp translate | \
chemfp lingo2fps | \
simsearch chembl_36_lingo.fpb -k 8 --out csv
query_id,target_id,score
phenol,CHEMBL14060,1.0000000
phenol,CHEMBL225564,0.8571429
phenol,CHEMBL278024,0.8571429
phenol,CHEMBL1213940,0.8571429
phenol,CHEMBL1233444,0.8571429
phenol,CHEMBL224144,0.8571429
phenol,CHEMBL280998,0.8571429
phenol,CHEMBL277500,0.8333333
I’ll take a look at the first three and last of these matches to show the results seem meaningful:
% zgrep 'CHEMBL14060$' chembl_36.smi.gz
Oc1ccccc1 CHEMBL14060
% zgrep 'CHEMBL225564$' chembl_36.smi.gz
Oc1ccccc1I CHEMBL225564
% zgrep 'CHEMBL278024$' chembl_36.smi.gz
COc1ccccc1 CHEMBL278024
% zgrep 'CHEMBL277500$' chembl_36.smi.gz
c1ccccc1 CHEMBL277500
As expected, phenol matches itself (CHEMBL14060), and the other records are indeed similar.
Canonicalization
The LINGO SMILES input must be consistently canonicalized. In the above pipeline, the translate step converted “c1ccccc1O” to “Oc1ccccc1”, which is the RDKit canonical form. If that step is omitted then the similarity search no longer finds CHEMBL14060 in the top 8 hits.
% echo "c1ccccc1O phenol" | \
chemfp lingo2fps | \
simsearch chembl_36_lingo.fpb -k 8 --out csv
query_id,target_id,score
phenol,CHEMBL28319,0.8571429
phenol,CHEMBL46931,0.8571429
phenol,CHEMBL280998,0.8571429
phenol,CHEMBL277500,0.8333333
phenol,CHEMBL155572,0.7500000
phenol,CHEMBL225343,0.7500000
phenol,CHEMBL280802,0.7500000
phenol,CHEMBL281223,0.7500000
The top three hits all end with “c1ccccc1O”:
% zgrep 'CHEMBL28319$' chembl_36.smi.gz
Nc1ccccc1O CHEMBL28319
% zgrep 'CHEMBL46931$' chembl_36.smi.gz
Cc1ccccc1O CHEMBL46931
% zgrep 'CHEMBL280998$' chembl_36.smi.gz
Oc1ccccc1O CHEMBL280998
lingo2fps --using
The pipeline in the earlier example requires knowing the appropriate lingo2fps parameters to use. What happens if there is a mismatch? The following creates a LINGO dataset with both “Closures” and “BrCl” normalization, then searches for “Clc1ccccc1”.
% chemfp lingo2fps --normalize Closures,BrCl chembl_36.smi.gz -o chembl_36_lingo_both.fpb
chembl_36.smi.gz: 100%|███████████████████████| 52.9M/52.9M [00:15<00:00, 3.46Mbytes/s]
% echo "Clc1ccccc1 ABC" | chemfp lingo2fps | simsearch chembl_36_lingo_both.fpb -k 5 --out csv
WARNING: queries has fingerprints of type 'LingoSuperimposed/1 num_bits=2048 nmer_size=4
normalize=Closures max_count=1000' but targets has fingerprints of type 'LingoSuperimposed/1
num_bits=2048 nmer_size=4 normalize=Closures|BrCl max_count=1000'
query_id,target_id,score
ABC,CHEMBL277500,0.7142857
ABC,CHEMBL499585,0.6666667
ABC,CHEMBL16068,0.6250000
ABC,CHEMBL16070,0.6250000
ABC,CHEMBL116296,0.6250000
%
% zgrep 'CHEMBL277500$' chembl_36.smi.gz
c1ccccc1 CHEMBL277500
The “WARNING” is a simsearch message that the query fingerprint type does not exactly match the target fingerprint type. The top match is CHEMBL277500, which is benzene.
I can normalize the SMILES manually, as “Lc1ccccc1”, which results in the same warning but at least has chlorobenzene as a perfect match:
% echo "Lc1ccccc1 ABC" | chemfp lingo2fps | simsearch chembl_36_lingo_both.fpb -k 5 --out csv
WARNING: queries has fingerprints of type 'LingoSuperimposed/1 num_bits=2048 nmer_size=4
normalize=Closures max_count=1000' but targets has fingerprints of type 'LingoSuperimposed/1
num_bits=2048 nmer_size=4 normalize=Closures|BrCl max_count=1000'
query_id,target_id,score
ABC,CHEMBL16200,1.0000000
ABC,CHEMBL298461,0.8571429
ABC,CHEMBL277500,0.8333333
ABC,CHEMBL16068,0.7142857
ABC,CHEMBL16070,0.7142857
%
% zgrep 'CHEMBL16200$' chembl_36.smi.gz
Clc1ccccc1 CHEMBL16200
% zgrep 'CHEMBL298461$' chembl_36.smi.gz
Clc1ccccc1Cl CHEMBL298461
The lingo2fps option --using will have it read the metadata in the
FPS or FPB file and parse the “type” field to get the parameters to
use:
% echo "c1ccccc1Cl ABC" | \
chemfp translate | \
chemfp lingo2fps --using chembl_36_lingo_both.fpb | \
simsearch chembl_36_lingo_both.fpb -k 5 --out csv
query_id,target_id,score
ABC,CHEMBL16200,1.0000000
ABC,CHEMBL298461,0.8571429
ABC,CHEMBL277500,0.8333333
ABC,CHEMBL16068,0.7142857
ABC,CHEMBL16070,0.7142857
Superimposed EState byte fingerprints
In this section you’ll learn to generate superimposed EState byte fingerprints using RDKit and OpenEye’s oechem. To generate EState count fingerprints with rdkit2fpc see EState count fingerprints.
The EState count fingerprints section describes the EState count fingerprints and how to generate them with chemfp rdkit2fpc. If the count fingerprints converted to byte fingerprint using superimposed coding then a Tanimoto nearest-neighbor search of the binary fingerprint with simsearch is a good approximation to the nearest-neighbort count Tanimoto search of the original fingerprints.
Note
A valid chemfp license key is required to generate more than 50,000 superimposed fingerprints in a single process.
This is conversion step is built into rdkit2fps and
oe2fps, which both support the --estate flag to
generate superimposed EState fingerprints using RDKit and OpenEye’s
OEChem, respectively, as in the following example with theobromine:
% echo "Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine" | rdkit2fps --estate --no-date | fold -70
#FPS1
#num_bits=2048
#type=EStateSuperimposed-RDKit/1 fpSize=2048
#software=RDKit/2025.03.3 chemfp/5.1
0000000400000000000000000000000000000008000000000000000000000000000000
0000000000000000000020000000000000000001000000000000000000000000000000
0000000000000000000000000000000000000004000000000000000000000000000000
0000000000000020000000000000000000000000000000000000000000000000000000
0000000000000800000000000000000000000000020000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0008000000000000000000000000000000000000002000000000000000000000000000
0000000000008000000000 theobromine
%
% echo "Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine" | oe2fps --estate --no-date | fold -70
#FPS1
#num_bits=2048
#type=EStateSuperimposed-OpenEye/1 numbits=2048
#software=OEChem/4.3.0.1 (20251120) chemfp/5.1
0000000400000000000000000000000000000008000000000000000000000000000000
0000000000000000000020000000000000000001000000000000000000000000000000
0000000000000000000000000000000000000004000000000000000000000000000000
0000000000000020000000000000000000000000000000000000000000000000000000
0000000000000800000000000000000000000000020000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0008000000000000000000000000000000000000002000000000000000000000000000
0000000000008000000000 theobromine