chemfp lingo2fps

The chemfp lingo2fps command creates superimposed byte fingerprints from SMILES LINGO fingerprints. A k-LINGO is a substring of length k from the input SMILES, after possible SMILES normalization. Each LINGO, and the number of times it was found, is used as a count fingerprint feature. The count fingerprint is converted to byte fingerprints using superimposed coding. The binary Tanimoto of these supereimposed fingerprints is a good approximation to the multiset/count Tanimoto of the original fingerprints for nearest-neighbor searches.

Note

A valid chemfp license key is required to generate more than 50,000 LINGO fingerprints in a single process.

The following generates byte fingerprints from a SMILES file read from stdin:

#FPS1
#num_bits=2048
#type=LingoSuperimposed/1 num_bits=2048 nmer_size=4 normalize=Closures
 max_count=1000
#software=chemfp/5.1
#date=2026-03-30T14:51:10+00:00
0000000000080400040000000000000000000204000020000000000000001000000000
0000000000000000000000000001000000000000000000000000000000000000000000
0000000000080020000000000000000000000000000000400000000000000000000200
0000000000000000020000000001000000000000000000000000000000000000000080
0000000000000000000000008000000040000000010000000000000000000000000000
0000000000000000000000400000000000000000000000000000000800000000000000
0000000000000000000000000000000000000000000000000800000000000000000001
0000000100000000000000        theobromine

The following reads a SMILES file, applies the closure normalization rule plus the “Br”->”R” and “Cl” ->”L” normalization rule, generates 3-mers, and saves the result to the FPS file “lingo.fps”.

% chemfp lingo2fps chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fps
chembl_36.smi.gz: 100%|█████████████████████| 52.9M/52.9M [00:18<00:00, 2.94Mbytes/s]

The following uses the LINGO parameters stored in the metadata of “lingo.fps” to process the SMILES input:

% echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fps --using lingo.fps | fold -70
#FPS1
#num_bits=2048
#type=LingoSuperimposed/1 num_bits=2048 nmer_size=3 normalize=Closures
|BrCl max_count=1000
#software=chemfp/5.1
#date=2026-03-30T14:56:41+00:00
0000000000000000000000000000000000000000000000000000000000000441000000
0000080000000000000000080000000000000010000000000020080000000000000000
0000000000800000000000000000020400000000000000000000000000000000000000
0000000000000000100000010001000000010000000000000000000000000000000000
0000000000000000000000000000000000000000000000000080000000000000008000
0200810000000000000000000800000000000000000000000000000000000000000010
0000000000000000000000000000000000000000000000000000000800000000000002
0000000000000000000000        theobromine

For additional examples see Generating LINGO byte fingerprints.

chemfp lingo2fps command-line options

The following comes from chemfp lingo2fps --help:

Usage: chemfp lingo2fps [OPTIONS] [FILENAMES]...

  Generate approximate LINGO SMILES holograms as byte fingerprints

Options:
  -n, --nmer-size [1|2|3|4|5|6|7|8]
                                  Size of the nmers to extract (default: 4)
  --normalize STR                 A comma or '|'-separated list of
                                  normalizations.
  --type TYPE_STR                 Specify the LingoSuperimposed type string to
                                  use
  --using FILENAME                Use the fingerprint type string from the
                                  name file
  --in FORMAT                     Input structure format (default guesses from
                                  filename)
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help                          Show this message and exit.

Converter options:
  --num-bits INT        Number of bits in the output fingerprint Default:
                        2048.
  --max-count INT|none  Maximum count to use. Default: 1000.

SMILES parsing:
  --has-header                    Skip the first line of the SMILES file.
  --delimiter VALUE               Delimiter style for the SMILES file.
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the heuristic
                                  which identifies and ignores any CXSMILES
                                  extension.
  --stop-at-whitespace / --no-stop-at-whitespace
                                  After the SMILES is extracted using
                                  --delimiter and --{no-}cxsmiles, use --no-
                                  stop-at-whitespace to process the entire
                                  SMILES instead of stopping at the first
                                  whitespace character.

  The `chemfp lingo2fps` command generates superimposed byte fingerprints
  which can be used with `chemfp simsearch` to give a fast approximate search
  for the count Tanimoto similarity of the original LINGO hologram for the
  input SMILES records.

  The program reads one or more SMILES files to get the SMILES and id of each
  record. For each SMILES, it generates nmers of the specified size counts up
  the number of each n-nmer, and uses superimposed coding to convert the
  (nmer, count) information into a binary fingerprint of a given size. Use
  `-n` or `--nmer-size` to set the nmer size to a value from 1 to 8 inclusive
  (the default is 4) and `--num-bits` to set the number of bits in byte
  fingerprint (the default is 2048).

  Each nmer is called a LINGO, as described in Vidal et al. (2005), JCIM
  (https://doi.org/10.1021/ci0496797). The set of LINGO counts is called a
  hologram. The similarity between two molecules can be approximated by
  comparing the similarity between two holograms.

  The Vidal et al. paper compared holograms with the integral Tanimoto. Grant
  et al. (2006), JCIM (https://doi.org/10.1021/ci6002152) proposed using the
  count/multiset Tanimoto, and demonstrated a fast search implementation based
  on finite state machines.

  Dalke (2026) J. Cheminf. (in press) proposed using superimposed coding to
  convert count fingerprints to byte fingerprints, such that a binary Tanimoto
  search of the byte fingerprints is an effective approximation for the count
  Tanimoto of the original count fingerprint.

  = Normalization =

  The original Vidal et al. paper normalized the SMILES strings by converting
  all closures to '0', 'Br' to 'R', and 'Cl' to 'L'.

  The Grant et al. paper normalized the closures but not Br or Cl. This is the
  default for chemfp's lingo implementation.

  Use the `--normalize` option to specify which normalization to use. The two
  available normalization are "closures" and "BrCl". These can be combined
  with the ',' or '|' to use both terms.

  For examples, the Vidal normalization is "closures|BrCl" and the Grant
  normalization is "closures".

  The normalization term "Default" is the same as "closures", and "0" means no
  normalization.

  = SMILES and CXSMILES =

  The lingo2fps command only supports SMILES files. Use `--has-header` to
  ignore the first line.

  If the record includes a CXSMILES extension and the `--delimiter` is "to-
  eol" (the default) or "space" then chemfp uses heristics to determine if the
  second term is a CXSMILES extension or an id. (With the "tab" delimiter
  there is no ambiguity.)

  By default the LINGO generation processes the SMILES field up until the
  first whitespace (which is either a space or tab). Use `--no-stop-at-
  whitespace` to process the entire SMILES field. When combined with a tab
  delimiter, this can be used to handle non-SMILES input. Remember to use
  `--normalize 0` in that case!

  = Canonicalization and non-SMILES input =

  The input is assumed to be properly canonicalized. If it is not, you might
  use `chemfp translate` to convert other formats into an appropriately
  canonicalized SMILES file.

  = Examples =

  To process a SMILES file from stdin:

    % echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fpb

  To specify the input SMILES file and output FPB file, normalization, and
  nmer size:

    % chemfp lingo2fps chembl_36.smi.gz --normalize Closures,BrCl --nmer-size 3 -o lingo.fpb

  To use the LINGO parameters from an FPB file:

    % echo 'Cn1cnc2c1c(=O)[nH]c(=O)n2C theobromine' | chemfp lingo2fps --using lingo.fpb