rdkit2fps

The “rdkit2fps” command (also available as the “chemfp rdkit2fps” subcommand) uses the RDKit toolkit to generate RDKit fingerprints from structure files.

This functionality is also available from Python using the high-level chemfp.rdkit2fps() function, following chemfp’s “*2fps” API.

The rest of this chapter contains the output from rdkit2fps --help and rdkit2fps --help-formats.

rdkit2fps command-line options

The following comes from rdkit2fps --help:

Usage: rdkit2fps [OPTIONS] [FILENAMES]...

  Generate fingerprints from a structure file using RDKit.

  If specified, process the filenames, otherwise read from stdin.

Fingerprint types:
  --RDK, --RDK/3                  Generate RDK/3 fingerprints (default).
  --RDK/2                         Generate RDK/2 fingerprints.
  --morgan1, --morgan1/2          Generate Morgan/2 fingerprints (radius=1).
  --morgan2, --morgan2/2          Generate Morgan/2 fingerprints (radius=2).
  --morgan, --morgan/2, --morgan3, --morgan3/2
                                  Generate Morgan/2 fingerprints (radius=3).
  --morgan4, --morgan4/2          Generate Morgan/2 fingerprints (radius=4).
  --morgan1/1                     Generate Morgan/1 fingerprints (radius=1).
  --morgan/1, --morgan2/1         Generate Morgan/1 fingerprints (radius=2).
  --morgan3/1                     Generate Morgan/1 fingerprints (radius=3).
  --morgan4/1                     Generate Morgan/1 fingerprints (radius=4).
  --torsion, --torsions, --torsion/2
                                  Generate Topological Torsion/2 fingerprints.
  --pair, --pairs, --pair/3       Generate AtomPair/3 fingerprints.
  --pair/2                        Generate AtomPair/2 fingerprints.
  --maccs166, --maccs             Generate MACCS fingerprints.
  --avalon                        Generate Avalon fingerprints.
  --pattern                       Generate (substructure) pattern
                                  fingerprints.
  --secfp                         Generate SECFP fingerprints, a circular
                                  fingerprint based on fragment SMILES instead
                                  of hashing.
  --substruct                     Generate chemfp's PubChem-like substructure
                                  fingerprints.
  --rdmaccs, --rdmaccs/2          Generate chemfp's MACCS fingerprints,
                                  version 2.
  --rdmaccs/1                     Generate chemfp's MACCS fingerprints,
                                  version 1.
  --type TYPE_STR                 Specify a chemfp type string
  --using FILENAME                Get the fingerprint type from the metadata
                                  of a fingerprint file

Fingerprint options:
  --fpSize INT                    number of bits in the fingerprint
                                  [AtomPair/2, AtomPair/3, Avalon, Morgan/1,
                                  Morgan/2, Pattern, RDKit/2, RDKit/3, SECFP,
                                  Torsion]
  --minPath INT                   Minimum number of bonds to include in the
                                  subgraph (default=1) [RDKit/2, RDKit/3]
  --maxPath INT                   Maximum number of bonds to include in the
                                  subgraph (default=7) [RDKit/2, RDKit/3]
  --nBitsPerHash INT              Number of bits to set per path (default=2)
                                  [RDKit/2]
  --useHs 0|1                     Include information about the number of
                                  hydrogens on each atom (default=1) [RDKit/2,
                                  RDKit/3]
  --branchedPaths 0|1             If 1, both branched and unbranched paths
                                  will be used in the fingerprint (default=1)
                                  [RDKit/2, RDKit/3]
  --useBondOrder 0|1              If 1, both bond orders will be used in the
                                  path hashes (default=1) [RDKit/2, RDKit/3]
  --fromAtoms, --from-atoms INT,INT,...
                                  List of atom indices to use (default=None)
                                  [AtomPair/2, AtomPair/3, Morgan/1, Morgan/2,
                                  RDKit/2, RDKit/3, Torsion]
  --countSimulation 0|1           if 1, simulate count fingerprints by setting
                                  more bits for higher counts. (default=1 for
                                  AtomPair/3, otherwise 0) [AtomPair/3,
                                  Morgan/2, RDKit/3]
  --countBounds INT,INT,...       list of minimum counts needed to set the
                                  corresponding bit during count simulation,
                                  eg, '1,2,4,8' (default=None) [AtomPair/3,
                                  Morgan/2, RDKit/3]
  --numBitsPerFeature INT         Number of bits to set per path (default=2)
                                  [RDKit/3]
  --radius INT                    radius for the Morgan or SECFP fingerprints
                                  [Morgan/1, Morgan/2, SECFP]
  --useFeatures 0|1               if 1, use chemical-feature invariants
                                  (default=0) [Morgan/1, Morgan/2]
  --useChirality 0|1              if 1, include chirality information
                                  (default=0) [Morgan/1]
  --useBondTypes 0|1              if 1, include bond type information
                                  (default=1) [Morgan/1, Morgan/2]
  --includeRedundantEnvironments 0|1
                                  if 1, include redundant environments in the
                                  fingerprint (default=0) [Morgan/1, Morgan/2]
  --includeChirality 0|1          include chirality information [AtomPair/2,
                                  AtomPair/3, Morgan/2, Torsion]
  --includeRingMembership 0|1     if 1, include ring membership in the atom
                                  invariants (default=1) [Morgan/2]
  --minLength INT                 Minimum bond count for a pair (default=1)
                                  [AtomPair/2]
  --maxLength INT                 Maximum bond count for a pair (default=30)
                                  [AtomPair/2]
  --nBitsPerEntry INT             Number of bits per entry (default=4)
                                  [AtomPair/2, Torsion]
  --use2D 0|1                     If 1, use 2D instead of 3D distance matrix
                                  (default=1) [AtomPair/2, AtomPair/3]
  --minDistance INT               minimum bond distance for two atoms to be
                                  considered a pair (default=1) [AtomPair/3]
  --maxDistance INT               maximum bond distance for two atoms to be
                                  considered a pair (default=30) [AtomPair/3]
  --targetSize INT                Number of bonds per torsion (default=4)
                                  [Torsion]
  --isQuery 0|1                   Is the fingerprint for a query structure? (1
                                  if yes, 0 if no) (default=0) [Avalon]
  --bitFlags INT                  Bit flags, SSSBits are 32767 and similarity
                                  bits are 15761407 (default=15761407)
                                  [Avalon]
  --rings 0|1                     If 1, add SSSR ring to the fingerprint
                                  (default=1) [SECFP]
  --isomeric 0|1                  If 1, use isomeric SMILES instead of non-
                                  isomeric SMILES (default=0) [SECFP]
  --kekulize 0|1                  If 1, use Kekule SMILES instead of aromatic
                                  SMILES (default=0) [SECFP]
  --min_radius, --min-radius INT  Minimum radius used to extract n-grams
                                  (default=1) [SECFP]

Options:
  --id-tag TAG                    Tag name containing the record id (SD files
                                  only)
  --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                  Forces '-R delimiter=VALUE'.
  --has-header                    Skip the first line of a SMILES or InChI
                                  file. Forces '-R has_header=1'.
  -R NAME=VALUE                   Specify a reader argument
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                  support for CXSMILES extensions. Forces '-R
                                  cxsmiles=1' or '-R cxsmiles=0'.
  --in FORMAT                     Input structure format (default guesses from
                                  filename)
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --errors [strict|report|ignore]
                                  How should structure parse errors be
                                  handled? (default=ignore)
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help-formats                  List the available formats and reader
                                  arguments
  --version                       Show the version and exit.
  --license-check                 Check the license and report results to
                                  stdout.
  --help                          Show this message and exit.

  This program guesses the input structure format and the compression based on
  the filename extension. If the guess fails then it assumes the input is an
  uncompressed SMILES file.

  If the data comes from stdin, or the guess based on extension name is wrong,
  then use "--in" to change the default input format.

  Use the '-R' reader arguments option to pass in format-specific structure
  reader arguments. The details depend on the specific format.

  Use the command-line option `--help-formats` to display a list of available
  formats and reader arguments.

  NOTE: The --RDK/2, --morgan/1 and --pair/2 fingerprints types use RDKit's
  older function API to generate fingerprints while --RDK/3, --morgan/2, and
  --pair/3 use the newer generator API. While the core approaches are the
  same, parameter names have changed, as well as some of the generation
  details, so the resulting fingerprints may have changed.

In particular, the default --morgan radius is now 3 instead of 2!

Supported rdkit2fps formats

The following comes from rdkit2fps --help-formats:

These are the structure file formats that chemfp can read when using the RDKit
toolkit.

By default, chemfp uses the filename extension to determine the format type.
If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or
Zstandard compressed file, and the second-to-last extension is used to
determine the format type. Unknown or unsupported extensions are interpreted
as a SMILES file.

You may instead specify the file format by name (see below), which is
especially important when reading from stdin, which has no associated filename
extension.

The supported filename extensions are:

   File Type    Extension(s)
   ==========   =============
     SMILES     can, ism, isosmi, smi, usm
      SDF       mdl, sd, sdf
     InChI      inchi
  Tripos Mol2   mol2
      PDB       ent, pdb
    Maestro     mae, maegz
     FASTA      faa, fasta

The format can also be specified by name using the '--in' option:

   File Type    Format name (append .gz or .zst if compressed)
   ==========   ==============================================
     SMILES     smi, can, usm
      SDF       sdf
     InChI      inchi
  Tripos Mol2   mol2
      PDB       pdb
    Maestro     mae
     FASTA      fasta

The input format parsers can be configured with the "-R" option. For example,
the following reader arguments tell the SMILES readers that the fields are
whitespace delimited and the first line is a header.

   -R delimiter=whitespace -R has_header=true

All of the input formats implement the 'sanitize' option, which is enabled by
default. Use "-R sanitize=false" to disable sanitization.

The SMILES format parsers use three additional reader arguments:

  * 'delimiter' specifies the delimiter type. The default is
    'to-eol'. The other values are 'tab', 'whitespace', 'space'
    and 'native'. Use "-R delimiter=native" to match RDKit's
    native delimiter style, which is 'whitespace'.
  * 'has_header', if false will skip the first line of the
    SMILES file (because it is a header line).
  * 'cxsmiles' describes how to handle CXSMILES extensions. The
    default (true) will have RDKit process the extension. If
    false any extension will be treated as part of the identifier.


The SDF format parser supports two additional reader arguments:

   * 'strictParsing', if false will disable strict parsing
   * 'removeHs', if false will keep all of the hydrogens

The InChI format parser supports four additional reader arguments:

   * 'delimiter' works the same as it does for the SMILES formats
   * 'removeHs' works the same as it does for the SDF format
   * 'treatWarningAsError', if true treats all warnings as errors
   * 'logLevel' specifies the RDKit/InChI library log level,
     as an integer

The Tripos Mol2 format parser supports two additional reader arguments:

   * 'removeHs' works the same as it does for the SDF format
   * 'cleanupSubstructures' if false disables standardizing
      some substructures found in Mol2 files

The PDB format parser supports three additional reader arguments:

   * 'removeHs' works the same as it does for the SDF format
   * 'flavor', an input parameter with no documented meaning
   * 'proximityBonding', if false will disable automatic
       automatic proximity bonding

The Maestro format parser supports one additional reader argument:

   * 'removeHs' works the same as it does for the SDF format

The FASTA format parser supports one additional reader argument:

   * 'flavor', an integer from 0 to 9. The values mean:
       0 - the sequence contains L-amino acids
       1 - allow lowercase for D-amino acids
       2 - RNA with no cap        6 - DNA with no cap
       3 - RNA with 5' cap        7 - DNA with 5' cap
       4 - RNA with 3' cap        8 - DNA with 3' cap
       5 - RNA with both caps     9 - DNA with both caps