rdkit2fps¶
The “rdkit2fps” command (also available as the “chemfp rdkit2fps” subcommand) uses the RDKit toolkit to generate RDKit fingerprints from structure files.
This functionality is also available from Python using the high-level
chemfp.rdkit2fps()
function, following chemfp’s “*2fps” API.
The rest of this chapter contains the output from rdkit2fps --help and rdkit2fps --help-formats.
rdkit2fps command-line options¶
The following comes from rdkit2fps --help
:
Usage: rdkit2fps [OPTIONS] [FILENAMES]...
Generate fingerprints from a structure file using RDKit.
If specified, process the filenames, otherwise read from stdin.
Fingerprint types:
--RDK, --RDK/3 Generate RDK/3 fingerprints (default).
--RDK/2 Generate RDK/2 fingerprints.
--morgan1, --morgan1/2 Generate Morgan/2 fingerprints (radius=1).
--morgan2, --morgan2/2 Generate Morgan/2 fingerprints (radius=2).
--morgan, --morgan/2, --morgan3, --morgan3/2
Generate Morgan/2 fingerprints (radius=3).
--morgan4, --morgan4/2 Generate Morgan/2 fingerprints (radius=4).
--morgan1/1 Generate Morgan/1 fingerprints (radius=1).
--morgan/1, --morgan2/1 Generate Morgan/1 fingerprints (radius=2).
--morgan3/1 Generate Morgan/1 fingerprints (radius=3).
--morgan4/1 Generate Morgan/1 fingerprints (radius=4).
--torsion, --torsions, --torsion/2
Generate Topological Torsion/2 fingerprints.
--pair, --pairs, --pair/3 Generate AtomPair/3 fingerprints.
--pair/2 Generate AtomPair/2 fingerprints.
--maccs166, --maccs Generate MACCS fingerprints.
--avalon Generate Avalon fingerprints.
--pattern Generate (substructure) pattern
fingerprints.
--secfp Generate SECFP fingerprints, a circular
fingerprint based on fragment SMILES instead
of hashing.
--substruct Generate chemfp's PubChem-like substructure
fingerprints.
--rdmaccs, --rdmaccs/2 Generate chemfp's MACCS fingerprints,
version 2.
--rdmaccs/1 Generate chemfp's MACCS fingerprints,
version 1.
--type TYPE_STR Specify a chemfp type string
--using FILENAME Get the fingerprint type from the metadata
of a fingerprint file
Fingerprint options:
--fpSize INT number of bits in the fingerprint
[AtomPair/2, AtomPair/3, Avalon, Morgan/1,
Morgan/2, Pattern, RDKit/2, RDKit/3, SECFP,
Torsion]
--minPath INT Minimum number of bonds to include in the
subgraph (default=1) [RDKit/2, RDKit/3]
--maxPath INT Maximum number of bonds to include in the
subgraph (default=7) [RDKit/2, RDKit/3]
--nBitsPerHash INT Number of bits to set per path (default=2)
[RDKit/2]
--useHs 0|1 Include information about the number of
hydrogens on each atom (default=1) [RDKit/2,
RDKit/3]
--branchedPaths 0|1 If 1, both branched and unbranched paths
will be used in the fingerprint (default=1)
[RDKit/2, RDKit/3]
--useBondOrder 0|1 If 1, both bond orders will be used in the
path hashes (default=1) [RDKit/2, RDKit/3]
--fromAtoms, --from-atoms INT,INT,...
List of atom indices to use (default=None)
[AtomPair/2, AtomPair/3, Morgan/1, Morgan/2,
RDKit/2, RDKit/3, Torsion]
--countSimulation 0|1 if 1, simulate count fingerprints by setting
more bits for higher counts. (default=1 for
AtomPair/3, otherwise 0) [AtomPair/3,
Morgan/2, RDKit/3]
--countBounds INT,INT,... list of minimum counts needed to set the
corresponding bit during count simulation,
eg, '1,2,4,8' (default=None) [AtomPair/3,
Morgan/2, RDKit/3]
--numBitsPerFeature INT Number of bits to set per path (default=2)
[RDKit/3]
--radius INT radius for the Morgan or SECFP fingerprints
[Morgan/1, Morgan/2, SECFP]
--useFeatures 0|1 if 1, use chemical-feature invariants
(default=0) [Morgan/1, Morgan/2]
--useChirality 0|1 if 1, include chirality information
(default=0) [Morgan/1]
--useBondTypes 0|1 if 1, include bond type information
(default=1) [Morgan/1, Morgan/2]
--includeRedundantEnvironments 0|1
if 1, include redundant environments in the
fingerprint (default=0) [Morgan/1, Morgan/2]
--includeChirality 0|1 include chirality information [AtomPair/2,
AtomPair/3, Morgan/2, Torsion]
--includeRingMembership 0|1 if 1, include ring membership in the atom
invariants (default=1) [Morgan/2]
--minLength INT Minimum bond count for a pair (default=1)
[AtomPair/2]
--maxLength INT Maximum bond count for a pair (default=30)
[AtomPair/2]
--nBitsPerEntry INT Number of bits per entry (default=4)
[AtomPair/2, Torsion]
--use2D 0|1 If 1, use 2D instead of 3D distance matrix
(default=1) [AtomPair/2, AtomPair/3]
--minDistance INT minimum bond distance for two atoms to be
considered a pair (default=1) [AtomPair/3]
--maxDistance INT maximum bond distance for two atoms to be
considered a pair (default=30) [AtomPair/3]
--targetSize INT Number of bonds per torsion (default=4)
[Torsion]
--isQuery 0|1 Is the fingerprint for a query structure? (1
if yes, 0 if no) (default=0) [Avalon]
--bitFlags INT Bit flags, SSSBits are 32767 and similarity
bits are 15761407 (default=15761407)
[Avalon]
--rings 0|1 If 1, add SSSR ring to the fingerprint
(default=1) [SECFP]
--isomeric 0|1 If 1, use isomeric SMILES instead of non-
isomeric SMILES (default=0) [SECFP]
--kekulize 0|1 If 1, use Kekule SMILES instead of aromatic
SMILES (default=0) [SECFP]
--min_radius, --min-radius INT Minimum radius used to extract n-grams
(default=1) [SECFP]
Options:
--id-tag TAG Tag name containing the record id (SD files
only)
--delimiter VALUE Delimiter style for SMILES and InChI files.
Forces '-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI
file. Forces '-R has_header=1'.
-R NAME=VALUE Specify a reader argument
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default
support for CXSMILES extensions. Forces '-R
cxsmiles=1' or '-R cxsmiles=0'.
--in FORMAT Input structure format (default guesses from
filename)
-o, --output FILENAME Save the fingerprints to FILENAME
(default=stdout)
--out FORMAT Output structure format (default guesses
from output filename, or is 'fps')
--include-metadata / --no-metadata
With --no-metadata, do not include the
header metadata for FPS output.
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--errors [strict|report|ignore]
How should structure parse errors be
handled? (default=ignore)
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--help-formats List the available formats and reader
arguments
--version Show the version and exit.
--license-check Check the license and report results to
stdout.
--help Show this message and exit.
This program guesses the input structure format and the compression based on
the filename extension. If the guess fails then it assumes the input is an
uncompressed SMILES file.
If the data comes from stdin, or the guess based on extension name is wrong,
then use "--in" to change the default input format.
Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format.
Use the command-line option `--help-formats` to display a list of available
formats and reader arguments.
NOTE: The --RDK/2, --morgan/1 and --pair/2 fingerprints types use RDKit's
older function API to generate fingerprints while --RDK/3, --morgan/2, and
--pair/3 use the newer generator API. While the core approaches are the
same, parameter names have changed, as well as some of the generation
details, so the resulting fingerprints may have changed.
In particular, the default --morgan radius is now 3 instead of 2!
Supported rdkit2fps formats¶
The following comes from rdkit2fps --help-formats
:
These are the structure file formats that chemfp can read when using the RDKit
toolkit.
By default, chemfp uses the filename extension to determine the format type.
If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or
Zstandard compressed file, and the second-to-last extension is used to
determine the format type. Unknown or unsupported extensions are interpreted
as a SMILES file.
You may instead specify the file format by name (see below), which is
especially important when reading from stdin, which has no associated filename
extension.
The supported filename extensions are:
File Type Extension(s)
========== =============
SMILES can, ism, isosmi, smi, usm
SDF mdl, sd, sdf
InChI inchi
Tripos Mol2 mol2
PDB ent, pdb
Maestro mae, maegz
FASTA faa, fasta
The format can also be specified by name using the '--in' option:
File Type Format name (append .gz or .zst if compressed)
========== ==============================================
SMILES smi, can, usm
SDF sdf
InChI inchi
Tripos Mol2 mol2
PDB pdb
Maestro mae
FASTA fasta
The input format parsers can be configured with the "-R" option. For example,
the following reader arguments tell the SMILES readers that the fields are
whitespace delimited and the first line is a header.
-R delimiter=whitespace -R has_header=true
All of the input formats implement the 'sanitize' option, which is enabled by
default. Use "-R sanitize=false" to disable sanitization.
The SMILES format parsers use three additional reader arguments:
* 'delimiter' specifies the delimiter type. The default is
'to-eol'. The other values are 'tab', 'whitespace', 'space'
and 'native'. Use "-R delimiter=native" to match RDKit's
native delimiter style, which is 'whitespace'.
* 'has_header', if false will skip the first line of the
SMILES file (because it is a header line).
* 'cxsmiles' describes how to handle CXSMILES extensions. The
default (true) will have RDKit process the extension. If
false any extension will be treated as part of the identifier.
The SDF format parser supports two additional reader arguments:
* 'strictParsing', if false will disable strict parsing
* 'removeHs', if false will keep all of the hydrogens
The InChI format parser supports four additional reader arguments:
* 'delimiter' works the same as it does for the SMILES formats
* 'removeHs' works the same as it does for the SDF format
* 'treatWarningAsError', if true treats all warnings as errors
* 'logLevel' specifies the RDKit/InChI library log level,
as an integer
The Tripos Mol2 format parser supports two additional reader arguments:
* 'removeHs' works the same as it does for the SDF format
* 'cleanupSubstructures' if false disables standardizing
some substructures found in Mol2 files
The PDB format parser supports three additional reader arguments:
* 'removeHs' works the same as it does for the SDF format
* 'flavor', an input parameter with no documented meaning
* 'proximityBonding', if false will disable automatic
automatic proximity bonding
The Maestro format parser supports one additional reader argument:
* 'removeHs' works the same as it does for the SDF format
The FASTA format parser supports one additional reader argument:
* 'flavor', an integer from 0 to 9. The values mean:
0 - the sequence contains L-amino acids
1 - allow lowercase for D-amino acids
2 - RNA with no cap 6 - DNA with no cap
3 - RNA with 5' cap 7 - DNA with 5' cap
4 - RNA with 3' cap 8 - DNA with 3' cap
5 - RNA with both caps 9 - DNA with both caps