chemfp lingo2fpc

The chemfp lingo2fpc command creates count fingerprints from SMILES LINGOs. A k-LINGO is a substring of length k from the input SMILES, after possible SMILES normalization. Each LINGO, and the number of times it was found, is used as a count fingerprint feature.

Note

A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.

The following generates count fingerprints from a SMILES file read from stdin:

% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc | fold -70
#FPC1
#num_bits=4294967296
#type=LingoCount/1 nmer_size=4 normalize=Closures
#software=chemfp/5.1
#date=2026-03-30T14:31:15+00:00
675106601:2,693857864,695087171,811804733,811806819,811822691,10285981
07,1028598126,1131294819,1214079784,1328110446,1328115248,1533954141,1
566779453,1663581519:2,1664115496,1664115504,1668178736,1848664942,185
0236259,1851994211    theobromine

The following reads a SMILES file, applies the closure normalization rule plus the “Br”->”R” and “Cl” ->”L” normalization rule, generates 3-mers, and saves the result to the FPC file “lingo.fpc”.

% chemfp lingo2fpc chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpc
chembl_36.smi.gz: 100%|█████████████████████| 52.9M/52.9M [00:14<00:00, 3.72Mbytes/s]

The following uses the LINGO parameters stored in the metadata of “lingo.fpc” to process the SMILES input:

% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc --using lingo.fpc | fold -70
#FPC1
#num_bits=16777216
#type=LingoCount/1 nmer_size=3 normalize=Closures|BrCl
#software=chemfp/5.1
#date=2026-03-30T14:55:30+00:00
2637135:2,2710382,2715184,3171112,3171120,3171182,4017961:2,4419120,47
42499,5187931,5187950,5992008,6120232,6498365:2,6500451:2,6516323,7221
315,7221347,7227485,7234352   theobromine

For additional examples see Generating LINGO count fingerprints.

chemfp lingo2fpc command-line options

The following comes from chemfp lingo2fpc --help:

Usage: chemfp lingo2fpc [OPTIONS] [FILENAMES]...

  Generate LINGO SMILES holograms as count fingerprints

Options:
  -n, --nmer-size [1|2|3|4|5|6|7|8]
                                  Size of the nmers to extract (default: 4)
  --normalize STR                 A comma or '|'-separated list of
                                  normalizations.
  --type TYPE_STR                 Specify the LingoCount type string to use
  --using FILENAME                Use the fingerprint type string from the
                                  name file
  --in FORMAT                     Input structure format (default guesses from
                                  filename)
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from output filename, or is 'fpc')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPC output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help                          Show this message and exit.

SMILES parsing:
  --has-header                    Skip the first line of the SMILES file.
  --delimiter VALUE               Delimiter style for the SMILES file.
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the heuristic
                                  which identifies and ignores any CXSMILES
                                  extension.
  --stop-at-whitespace / --no-stop-at-whitespace
                                  After the SMILES is extracted using
                                  --delimiter and --{no-}cxsmiles, use --no-
                                  stop-at-whitespace to process the entire
                                  SMILES instead of stopping at the first
                                  whitespace character.

  The `chemfp lingo2fpc` command generate count fingerprints corresponding to
  the LINGO hologram of the input SMILES.

  The program reads one or more SMILES files to get the SMILES and id of each
  record. For each SMILES, it generates nmers of the specified size counts up
  the number of each n-nmer, then uses the (nmer, count) information to
  generate a count fingerprint in FPC format. Use `-n` or `--nmer-size` to set
  the nmer size to a value from 1 to 8 inclusive (the default is 4).

  Each nmer is called a LINGO, as described in Vidal et al. (2005), JCIM
  (https://doi.org/10.1021/ci0496797). The set of LINGO counts is called a
  hologram. The similarity between two molecules can be approximated by
  comparing the similarity between two holograms.

  The Vidal et al. paper compared holograms with the integral Tanimoto. Grant
  et al. (2006), JCIM (https://doi.org/10.1021/ci6002152) proposed using the
  count/multiset Tanimoto, and demonstrated a fast search implementation based
  on finite state machines.

  Chemfp does not implement a direct similarity search for count fingerprints.
  You should consider using `chemfp lingo2fps` to generate the LINGO hologram
  as a superimposed byte fingerprint, then use `chemfp simsearch` as a fast
  approximate count Tanimoto search.

  Alternatively, you can use the FPC output as input to your own tools.

  Each LINGO is encoded to a 64-bit value, with 8 bits per token, after UTF-8
  conversion and normalization. The LINGO "CC=O" corresponds to a count
  fingerprint with feature 1128480079, which is the decimal version of the hex
  value "43433d4f" where "43" is "C", "3d" is "=" and "4f" is "O".

  = Normalization =

  The original Vidal et al. paper normalized the SMILES strings by converting
  all closures to '0', 'Br' to 'R', and 'Cl' to 'L'.

  The Grant et al. paper normalized the closures but not Br or Cl. This is the
  default for chemfp's lingo implementation.

  Use the `--normalize` option to specify which normalization to use. The two
  available normalization are "Closures" and "BrCl". These can be combined
  with the ',' or '|' to use both terms.

  For examples, the Vidal normalization is "Closures|BrCl" and the Grant
  normalization is "Closures".

  The normalization term "Default" is the same as "Closures", and "0" means no
  normalization.

  = SMILES and CXSMILES =

  The lingo2fpc command only supports SMILES files. Use `--has-header` to
  ignore the first line.

  If the record includes a CXSMILES extension and the `--delimiter` is "to-
  eol" (the default) or "space" then chemfp uses heristics to determine if the
  second term is a CXSMILES extension or an id. (With the "tab" delimiter
  there is no ambiguity.)

  By default the LINGO generation processes the SMILES field up until the
  first whitespace (which is either a space or tab). Use `--no-stop-at-
  whitespace` to process the entire SMILES field. When combined with a tab
  delimiter, this can be used to handle non-SMILES input. Remember to use
  `--normalize 0` in that case!

  = Canonicalization and non-SMILES input =

  The input is assumed to be properly canonicalized. If it is not, you might
  use `chemfp translate` to convert other formats into an appropriately
  canonicalized SMILES file.

  = Examples =

  To process a SMILES file from stdin:

    % echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc

  To specify the input SMILES file and output FPC file, normalization, and
  nmer size:

    % chemfp lingo2fpc chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpc

  To use the LINGO parameters from an FPC file:

    % echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpc --using lingo.fpc