chemfp lingo2fps
The chemfp lingo2fps command creates superimposed byte
fingerprints from SMILES LINGO fingerprints. A k-LINGO is a substring
of length k from the input SMILES, after possible SMILES
normalization. Each LINGO, and the number of times it was found, is
used as a count fingerprint feature. The count fingerprint is
converted to byte fingerprints using superimposed coding. The binary
Tanimoto of these supereimposed fingerprints is a good approximation
to the multiset/count Tanimoto of the original fingerprints for
nearest-neighbor searches.
Note
A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.
The following generates byte fingerprints from a SMILES file read from stdin:
#FPS1
#num_bits=2048
#type=LingoSuperimposed/1 num_bits=2048 nmer_size=4 normalize=Closures
max_count=1000
#software=chemfp/5.1
#date=2026-03-30T14:51:10+00:00
0000000000080400040000000000000000000204000020000000000000001000000000
0000000000000000000000000001000000000000000000000000000000000000000000
0000000000080020000000000000000000000000000000400000000000000000000200
0000000000000000020000000001000000000000000000000000000000000000000080
0000000000000000000000008000000040000000010000000000000000000000000000
0000000000000000000000400000000000000000000000000000000800000000000000
0000000000000000000000000000000000000000000000000800000000000000000001
0000000100000000000000 theobromine
The following reads a SMILES file, applies the closure normalization rule plus the “Br”->”R” and “Cl” ->”L” normalization rule, generates 3-mers, and saves the result to the FPS file “lingo.fps”.
% chemfp lingo2fps chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fps
chembl_36.smi.gz: 100%|█████████████████████| 52.9M/52.9M [00:18<00:00, 2.94Mbytes/s]
The following uses the LINGO parameters stored in the metadata of “lingo.fps” to process the SMILES input:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fps --using lingo.fps | fold -70
#FPS1
#num_bits=2048
#type=LingoSuperimposed/1 num_bits=2048 nmer_size=3 normalize=Closures
|BrCl max_count=1000
#software=chemfp/5.1
#date=2026-03-30T14:56:41+00:00
0000000000000000000000000000000000000000000000000000000000000441000000
0000080000000000000000080000000000000010000000000020080000000000000000
0000000000800000000000000000020400000000000000000000000000000000000000
0000000000000000100000010001000000010000000000000000000000000000000000
0000000000000000000000000000000000000000000000000080000000000000008000
0200810000000000000000000800000000000000000000000000000000000000000010
0000000000000000000000000000000000000000000000000000000800000000000002
0000000000000000000000 theobromine
For additional examples see Generating LINGO byte fingerprints.
chemfp lingo2fps command-line options
The following comes from chemfp lingo2fps --help:
Usage: chemfp lingo2fps [OPTIONS] [FILENAMES]...
Generate approximate LINGO SMILES holograms as byte fingerprints
Options:
-n, --nmer-size [1|2|3|4|5|6|7|8]
Size of the nmers to extract (default: 4)
--normalize STR A comma or '|'-separated list of
normalizations.
--type TYPE_STR Specify the LingoSuperimposed type string to
use
--using FILENAME Use the fingerprint type string from the
name file
--in FORMAT Input structure format (default guesses from
filename)
-o, --output FILENAME Save the fingerprints to FILENAME
(default=stdout)
--out FORMAT Output structure format (default guesses
from output filename, or is 'fps')
--include-metadata / --no-metadata
With --no-metadata, do not include the
header metadata for FPS output.
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--help Show this message and exit.
Converter options:
--num-bits INT Number of bits in the output fingerprint Default:
2048.
--max-count INT|none Maximum count to use. Default: 1000.
SMILES parsing:
--has-header Skip the first line of the SMILES file.
--delimiter VALUE Delimiter style for the SMILES file.
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the heuristic
which identifies and ignores any CXSMILES
extension.
--stop-at-whitespace / --no-stop-at-whitespace
After the SMILES is extracted using
--delimiter and --{no-}cxsmiles, use --no-
stop-at-whitespace to process the entire
SMILES instead of stopping at the first
whitespace character.
The `chemfp lingo2fps` command generates superimposed byte fingerprints
which can be used with `chemfp simsearch` to give a fast approximate search
for the count Tanimoto similarity of the original LINGO hologram for the
input SMILES records.
The program reads one or more SMILES files to get the SMILES and id of each
record. For each SMILES, it generates nmers of the specified size counts up
the number of each n-nmer, and uses superimposed coding to convert the
(nmer, count) information into a binary fingerprint of a given size. Use
`-n` or `--nmer-size` to set the nmer size to a value from 1 to 8 inclusive
(the default is 4) and `--num-bits` to set the number of bits in byte
fingerprint (the default is 2048).
Each nmer is called a LINGO, as described in Vidal et al. (2005), JCIM
(https://doi.org/10.1021/ci0496797). The set of LINGO counts is called a
hologram. The similarity between two molecules can be approximated by
comparing the similarity between two holograms.
The Vidal et al. paper compared holograms with the integral Tanimoto. Grant
et al. (2006), JCIM (https://doi.org/10.1021/ci6002152) proposed using the
count/multiset Tanimoto, and demonstrated a fast search implementation based
on finite state machines.
Dalke (2026) J. Cheminf. (in press) proposed using superimposed coding to
convert count fingerprints to byte fingerprints, such that a binary Tanimoto
search of the byte fingerprints is an effective approximation for the count
Tanimoto of the original count fingerprint.
= Normalization =
The original Vidal et al. paper normalized the SMILES strings by converting
all closures to '0', 'Br' to 'R', and 'Cl' to 'L'.
The Grant et al. paper normalized the closures but not Br or Cl. This is the
default for chemfp's lingo implementation.
Use the `--normalize` option to specify which normalization to use. The two
available normalization are "closures" and "BrCl". These can be combined
with the ',' or '|' to use both terms.
For examples, the Vidal normalization is "closures|BrCl" and the Grant
normalization is "closures".
The normalization term "Default" is the same as "closures", and "0" means no
normalization.
= SMILES and CXSMILES =
The lingo2fps command only supports SMILES files. Use `--has-header` to
ignore the first line.
If the record includes a CXSMILES extension and the `--delimiter` is "to-
eol" (the default) or "space" then chemfp uses heristics to determine if the
second term is a CXSMILES extension or an id. (With the "tab" delimiter
there is no ambiguity.)
By default the LINGO generation processes the SMILES field up until the
first whitespace (which is either a space or tab). Use `--no-stop-at-
whitespace` to process the entire SMILES field. When combined with a tab
delimiter, this can be used to handle non-SMILES input. Remember to use
`--normalize 0` in that case!
= Canonicalization and non-SMILES input =
The input is assumed to be properly canonicalized. If it is not, you might
use `chemfp translate` to convert other formats into an appropriately
canonicalized SMILES file.
= Examples =
To process a SMILES file from stdin:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpb
To specify the input SMILES file and output FPB file, normalization, and
nmer size:
% chemfp lingo2fps chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpb
To use the LINGO parameters from an FPB file:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fps --using lingo.fpb