chemfp lingo2fpc
The chemfp lingo2fpc command creates count fingerprints from
SMILES LINGOs. A k-LINGO is a substring of length k from the input
SMILES, after possible SMILES normalization. Each LINGO, and the
number of times it was found, is used as a count fingerprint feature.
Note
A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.
The following generates count fingerprints from a SMILES file read from stdin:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc | fold -70
#FPC1
#num_bits=4294967296
#type=LingoCount/1 nmer_size=4 normalize=Closures
#software=chemfp/5.1
#date=2026-03-30T14:31:15+00:00
675106601:2,693857864,695087171,811804733,811806819,811822691,10285981
07,1028598126,1131294819,1214079784,1328110446,1328115248,1533954141,1
566779453,1663581519:2,1664115496,1664115504,1668178736,1848664942,185
0236259,1851994211 theobromine
The following reads a SMILES file, applies the closure normalization rule plus the “Br”->”R” and “Cl” ->”L” normalization rule, generates 3-mers, and saves the result to the FPC file “lingo.fpc”.
% chemfp lingo2fpc chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpc
chembl_36.smi.gz: 100%|█████████████████████| 52.9M/52.9M [00:14<00:00, 3.72Mbytes/s]
The following uses the LINGO parameters stored in the metadata of “lingo.fpc” to process the SMILES input:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc --using lingo.fpc | fold -70
#FPC1
#num_bits=16777216
#type=LingoCount/1 nmer_size=3 normalize=Closures|BrCl
#software=chemfp/5.1
#date=2026-03-30T14:55:30+00:00
2637135:2,2710382,2715184,3171112,3171120,3171182,4017961:2,4419120,47
42499,5187931,5187950,5992008,6120232,6498365:2,6500451:2,6516323,7221
315,7221347,7227485,7234352 theobromine
For additional examples see Generating LINGO count fingerprints.
chemfp lingo2fpc command-line options
The following comes from chemfp lingo2fpc --help:
Usage: chemfp lingo2fpc [OPTIONS] [FILENAMES]...
Generate LINGO SMILES holograms as count fingerprints
Options:
-n, --nmer-size [1|2|3|4|5|6|7|8]
Size of the nmers to extract (default: 4)
--normalize STR A comma or '|'-separated list of
normalizations.
--type TYPE_STR Specify the LingoCount type string to use
--using FILENAME Use the fingerprint type string from the
name file
--in FORMAT Input structure format (default guesses from
filename)
-o, --output FILENAME Save the fingerprints to FILENAME
(default=stdout)
--out FORMAT Output structure format (default guesses
from output filename, or is 'fpc')
--include-metadata / --no-metadata
With --no-metadata, do not include the
header metadata for FPC output.
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--help Show this message and exit.
SMILES parsing:
--has-header Skip the first line of the SMILES file.
--delimiter VALUE Delimiter style for the SMILES file.
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the heuristic
which identifies and ignores any CXSMILES
extension.
--stop-at-whitespace / --no-stop-at-whitespace
After the SMILES is extracted using
--delimiter and --{no-}cxsmiles, use --no-
stop-at-whitespace to process the entire
SMILES instead of stopping at the first
whitespace character.
The `chemfp lingo2fpc` command generate count fingerprints corresponding to
the LINGO hologram of the input SMILES.
The program reads one or more SMILES files to get the SMILES and id of each
record. For each SMILES, it generates nmers of the specified size counts up
the number of each n-nmer, then uses the (nmer, count) information to
generate a count fingerprint in FPC format. Use `-n` or `--nmer-size` to set
the nmer size to a value from 1 to 8 inclusive (the default is 4).
Each nmer is called a LINGO, as described in Vidal et al. (2005), JCIM
(https://doi.org/10.1021/ci0496797). The set of LINGO counts is called a
hologram. The similarity between two molecules can be approximated by
comparing the similarity between two holograms.
The Vidal et al. paper compared holograms with the integral Tanimoto. Grant
et al. (2006), JCIM (https://doi.org/10.1021/ci6002152) proposed using the
count/multiset Tanimoto, and demonstrated a fast search implementation based
on finite state machines.
Chemfp does not implement a direct similarity search for count fingerprints.
You should consider using `chemfp lingo2fps` to generate the LINGO hologram
as a superimposed byte fingerprint, then use `chemfp simsearch` as a fast
approximate count Tanimoto search.
Alternatively, you can use the FPC output as input to your own tools.
Each LINGO is encoded to a 64-bit value, with 8 bits per token, after UTF-8
conversion and normalization. The LINGO "CC=O" corresponds to a count
fingerprint with feature 1128480079, which is the decimal version of the hex
value "43433d4f" where "43" is "C", "3d" is "=" and "4f" is "O".
= Normalization =
The original Vidal et al. paper normalized the SMILES strings by converting
all closures to '0', 'Br' to 'R', and 'Cl' to 'L'.
The Grant et al. paper normalized the closures but not Br or Cl. This is the
default for chemfp's lingo implementation.
Use the `--normalize` option to specify which normalization to use. The two
available normalization are "Closures" and "BrCl". These can be combined
with the ',' or '|' to use both terms.
For examples, the Vidal normalization is "Closures|BrCl" and the Grant
normalization is "Closures".
The normalization term "Default" is the same as "Closures", and "0" means no
normalization.
= SMILES and CXSMILES =
The lingo2fpc command only supports SMILES files. Use `--has-header` to
ignore the first line.
If the record includes a CXSMILES extension and the `--delimiter` is "to-
eol" (the default) or "space" then chemfp uses heristics to determine if the
second term is a CXSMILES extension or an id. (With the "tab" delimiter
there is no ambiguity.)
By default the LINGO generation processes the SMILES field up until the
first whitespace (which is either a space or tab). Use `--no-stop-at-
whitespace` to process the entire SMILES field. When combined with a tab
delimiter, this can be used to handle non-SMILES input. Remember to use
`--normalize 0` in that case!
= Canonicalization and non-SMILES input =
The input is assumed to be properly canonicalized. If it is not, you might
use `chemfp translate` to convert other formats into an appropriately
canonicalized SMILES file.
= Examples =
To process a SMILES file from stdin:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc
To specify the input SMILES file and output FPC file, normalization, and
nmer size:
% chemfp lingo2fpc chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpc
To use the LINGO parameters from an FPC file:
% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc --using lingo.fpc