cdk2fps

The “cdk2fps” command (also available as the “chemfp cdk2fps” subcommand) uses the CDK toolkit to generate CDK and jCompoundMapper fingerprints from structure files.

This functionality is also available from Python using the high-level chemfp.cdk2fps() function, following chemfp’s “*2fps” API.

The rest of this chapter contains the output from chemfp cdk2fps --help, chemfp cdk2fps --help-formats, and chemfp cdk2fps --help-jcmapper.

cdk2fps command-line options

The following comes from cdk2fps --help:

Usage: cdk2fps [OPTIONS] [FILENAMES]...

  Generate fingerprints from a structure file using CDK.

  If specified, process the filenames, otherwise read from stdin.

Fingerprint types:
  --Daylight              Make Daylight-like fingerprints using
                          cdk.fingerprinter.Fingerprinter (default). Requires
                          explicit hydrogens.
  --GraphOnly             Make Daylight-like fingerprints (ignoring bond
                          types). Requires explicit hydrogens.
  --MACCS                 Make 166-bit MACCS keys using MACCSFingerprinter.
                          Requires explicit hydrogens.
  --EState                Make 79-bit EState fingerprints using
                          EStateFingerprinter. Requires fully implicit
                          hydrogens.
  --Extended              Make Daylight-like fingerprints extended with ring
                          feature bits, using ExtendedFingerprinter. Requires
                          explicit hydrogens.
  --Hybridization         Make Daylight-like fingerprints based on SP2
                          hybridization instead of aromaticity. Requires
                          explicit hydrogens.
  --KlekotaRoth           Make 4860-bit Klekota-Roth fingerprint.
  --Pubchem               Make 881-bit PubChem fingerprint. CDK 2.8 and
                          earlier required explicit hydrogens.
  --Substructure          Make 307-bit substructure fingerprint. Requires
                          fully implicit hydrogens.
  --ShortestPath          Make fingerprints based on the shortest path between
                          atoms, ring systems, and more, using
                          ShortestPathFingerprinter
  --ECFP0                 Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP0)
  --ECFP2                 Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP2)
  --ECFP4                 Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP4)
  --ECFP6                 Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP6)
  --FCFP0                 Make FCFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_FCFP0)
  --FCFP2                 Make FCFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_FCFP2)
  --FCFP4                 Make FCFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_FCFP4)
  --FCFP6                 Make FCFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_FCFP6)
  --AtomPairs2D           Make 780-bit atom-pair fingerprints adapted from Yap
                          Chun Wei's PaDEL.
  --substruct             Generate ChemFP substructure fingerprints (you
                          likely want to use --Pubchem instead)
  --rdmaccs, --rdmaccs/2  Generate chemfp's MACCS fingerprints, version 2.
  --rdmaccs/1             Generate chemfp's MACCS fingerprints, version 1.
  --type TYPE_STR         Specify a chemfp type string
  --using FILENAME        Get the fingerprint type from the metadata of a
                          fingerprint file

Fingerprint options:
  --size INT                     Fingerprint size (default=1024) [Daylight,
                                 ECFP, Extended, FCFP, GraphOnly,
                                 Hybridization, ShortestPath]
  --searchDepth INT              Search depth (default=7) [Daylight, Extended,
                                 GraphOnly, Hybridization]
  --pathLimit INT                Path limit (default=42000) [Daylight,
                                 Extended, GraphOnly, Hybridization]
  --hashPseudoAtoms 0|1          Include pseudo-atoms in path enumeration
                                 (default=0) [Daylight, Extended, GraphOnly,
                                 Hybridization]
  --perceiveStereochemistry 0|1  Re-perceive stereochemistry from 2D/3D
                                 coordinates (default=0) [ECFP, FCFP]

Options:
  --id-tag TAG                    Get the record it from the tag TAG instead
                                  of the first line of the record.
  --in FORMAT                     Input structure format (default guesses from
                                  filename)
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                  Forces '-R delimiter=VALUE'.
  --has-header                    Skip the first line of a SMILES or InChI
                                  file. Forces '-R has_header=1'.
  -R NAME=VALUE                   Specify a reader argument
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                  support for CXSMILES extensions. Forces '-R
                                  cxsmiles=1' or '-R cxsmiles=0'.
  --errors [strict|report|ignore]
                                  How should structure parse errors be
                                  handled? (default=ignore)
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help-formats                  List the available formats and reader
                                  arguments
  --help-jcmapper                 Describe how to use the jCompoundMapper
                                  fingerprint types.
  --version                       Show the version and exit.
  --license-check                 Check the license and report results to
                                  stdout.
  --help                          Show this message and exit.

  By default the CDK structure reader determines the file format and
  compression type based on the filename extension. Unknown filename
  extensions are treated as a uncompressed SMILES files.

  If the data comes from stdin, or the guess based on extension name is wrong,
  then use "--in FORMAT" option to change the default input format. For
  examples:

     --in smi
     --in sdf.gz

  Use `-R` to specify format-specific reader arguments. In particular, use

     -R hydrogens=make-explicit

  to ensure that all implicit hydrogens are made explicit, and use

     -R hydrogens=make-implicit

  to ensure that all explicit hydrogens (including chiral hydrogesn) are made
  explicit.

  Use `--help-formats` for a list of available formats and reader arguments.

  Use `--help-jcmapper` for information about chemfp's support for the
  jCompoundMapper fingerprint types.

  Examples:

  1) Generate Daylight fingerprints from a SMILES file and ensure that all
  implicit hydrogens are made explicit (CDK's Daylight fingerprinter requires
  explicit hydrogens):

    cdk2fps dataset.smi -R hydrogens=make-explicit

  -or-

    cdk2fps --Daylight dataset.smi -R hydrogens=make-explicit

  2) Generate EState fingerprints from a gzip-compressed SDF and ensure that
  all explicit hydrogens are made implicit (CDK's EState pattern definitions
  require implicit hydrogens). Save the results to "test.fps".

    cdk2fps --EState dataset.sdf.gz -R hydrogens=make-implicit -o test.fps

Supported cdk2fps formats

The following comes from cdk2fps --help-formats:

These are the structure file formats that chemfp and read when using the CDK
toolkit.

By default, chemfp uses the filename extension to determine the format type.
If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or
Zstandard compressed file, and the second-to-last extension is used to
determine the format type. Unknown or unsupported extensions are interpreted
as a SMILES file.

Note: Zstandard support may depend on the "zstandard" Python package and/or
the "zstd-jni" Java package. To install the Python package see
https://pypi.org/project/zstandard/ . To get the Java jar file, see
https://github.com/luben/zstd-jni and place it in your CLASSPATH.

You may instead specify the file format by name (see below), which is
especially important when reading from stdin, which has no associated filename
extension.

The supported filename extensions are:

   File Type    Extension(s)
   ==========   =============
     SMILES     can, ism, isosmi, smi, usm
      SDF       mdl, sd, sdf
     InChI      inchi

The format can also be specified by name using the '--in' option:

   File Type    Format name (append .gz or .zst if compressed)
   ==========   ==============================================
     SMILES     smi, can, usm
      SDF       sdf
     InChI      inchi

The input format parsers can be configured with the "-R" option. For example,
the following reader arguments tell the SMILES readers that the fields are
whitespace delimited and the first line is a header.

   -R delimiter=whitespace -R has_header=true

The SMILES format parsers use five additional reader arguments:

   * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
     The other values are 'tab', 'whitespace', 'space' and 'native'.
     Use "-R delimiter=native" to match RDKit's native delimiter
     style, which is 'whitespace'.
   * 'has_header', if false will skip the first line of the SMILES
     file (because it is a header line).
   * 'cxsmiles' describes how to handle CXSMILES extensions. The
     default (true) will have CDK process the extension. If false
     any extension will be treated as part of the identifier.
   * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0'
     to skip this step.
   * 'hydrogens': The default of 'as-is' will keep the implicit and
     explicit hydrogens unchanged. See below for the other options.
   * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader()
     to parse the SMILES file. The 'chemfp' implementation uses chemfp's
     Python-based SMILES file parser and CDK's SmilesParser() to parse
     parse each SMILES string. The chemfp implementation is slower
     but may have better error-handling and/or reporting.

The SDF format parser supports five reader arguments:

   * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed
     mode supports some records with recoverable errors. The strict
     mode fails to parse those records.
   * 'ForceReadAs3DCoordinates', with the default of '0' interprets
     V2000 records where all z-coordinates == 0.0 as 2D records. The
     value '1' tells CDK to interpret all records as 3D.
   * 'AddStereoElements' with the default of '1' adds 0D stereochemistry
     to V2000 records. The value of '0' skips that step.
   * 'InterpretHydrogenIsotopes with the default of '1' interprets the
     atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use
     '0' to disable this interpretation.
   * 'hydrogens': The default of 'as-is' will keep the implicit and
     explicit hydrogens unchanged. See below for the other options.
   * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory()
     to parse the SD file. The 'chemfp' implementation uses chemfp's
     SD file parser to parse records, and CDK's MDLReader(),
     MDLV2000Reader(), or MDLV3000Reader() to parse each record. The
     chemfp implementation is about 50% slower than the cdk parser but
     may have better error-handling and/or reporting.

Use the 'hydrogens' to convert implicit hydrogens to explicit, or convert some
or all explicit hydrogens into implict, The four supported values are:

  * "as-is": leave them unchanged   * "make-explicit": convert all implicit
  hydrogens to explicit   * "make-implicit": convert all explicit hydrogens to
  implicit   * "make-nonchiral-implicit": convert non-chiral hydrogens to
  implicit

Many of the fingerprint types require explicit hydrogens, and some require
fully implicit hydrogens. These are noted in the `--help`.

The InChI format parser supports one reader argument:

   * 'delimiter' works the same as it does for the SMILES formats

Help using jCompoundMapper

The following comes from cdk2fps --help-jcmapper:

Chemfp supports several jCompoundMapper fingerprint types, but getting it to
work is a bit tricky.

How to get started with JCompoundMapper:

1) Get either `jCMapperCLI.jar` or `jCMapperLibOnly.jar` from
https://jcompoundmapper.sourceforge.net/ . Alternatively a copy of
jCMapperCLI.jar is available by unzipping jCMapperCLI.zip from
https://github.com/dahvida/NP_Fingerprints/tree/main/Scripts/FP_calc .

2) Download a recent version of the CDK from https://cdk.github.io/ .

3) Specify both JAR location on your CLASSPATH, with the CDK jar *before* the
jCMapperCLI or jCMapperLibOnly. This is important as the jCMapperCLI includes
old version of the CDK which chemfp does not support. By placing the CDK jar
first, the newer CDK is used instead of the older one. For example, I use the
CLASSPATH:

  $HOME/jars/cdk-2.9.jar:$HOME/jars/jCMapperCLI.jar

jCompoundMapper depends on CDK's old AtomContainer implementation, which is no
longer the default, but can be enabled by starting the JVM with the
"-DCdkUseLegacyAtomContainer=t" flag before loading the CDK. Unfortunately,
CDK's own fingerprint types do not work with the old AtomContainer
implementation making it impossible to use both the CDK and jCompoundMapper
fingerprint types at the same time.

4) Install JPype, which is a Python bridge to the JVM. See
https://jpype.readthedocs.io/en/latest/ for details. It's what chemfp uses to
be able to work with both the CDK and JCompoundMapper. If you use pip you can
install it with:

  pip install JPype1

On the chemfp side, when it needs the CDK, and if the JVM isn't already
running, it first checks the CLASSPATH. If either of 'jCMapperCLI.jar' or
jCMapperLibOnly.jar' are present, it sets the backwards-compatibility flag
before starting the JVM. This will cause the CDK to print the following
warning to stderr:

  [WARN] Using the old AtomContainer implementation.

The jCompoundMapper fingerprint types:

The jCompoundMapper fingerprint types are available in cdk2fps using the
`--type` option or, indirectly, with the `--using` flag, which gets the type
string from the "#type=" header of the fingerprint file. They are also
available from the Python API through `chemfp.get_fingerprint_type()`.

The suppported fingerprint types and default type strings are:

 * Depth-First Search (DFS)
   - "jCMapper-DFS hashsize=4096 searchDepth=7 atomLabel=ELEMENT_NEIGHBOR"
 * All Shortest Paths (ASP):
   - "jCMapper-ASP hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR"
 * Local Path Environments (LSTAR):
   - "jCMapper-LSTAR hashsize=4096 searchDepth=6 atomLabel=ELEMENT_NEIGHBOR"
 * Topological Molprint-like fingerprints (RAD2D)
   - "jCMapper-RAD2D hashsize=4096 searchDepth=3 atomLabel=ELEMENT_SYMBOL"
 * 2-point topological pharmacophore pairs (PH2)
   - "jCMapper-PH2 hashsize=4096 searchDepth=8"
 * 3-point topological pharmacophore triples (PH3)
   - "jCMapper-PH3 hashsize=4096 searchDepth=5"
 * 2-point topological atom type pairs (AP2D)
   - "jCMapper-AP2D hashsize=4096 searchDepth=8 atomLabel=ELEMENT_NEIGHBOR"
 * 3-point topological atom type triplets (AT2D)
   - "jCMapper-AT2D hashsize=4096 searchDepth=5 atomLabel=ELEMENT_NEIGHBOR"

The generated fingerprint `hashsize` bits long, which must be a positive
integer. Most fingerprint types takes a `searchDepth` which must be a non-
negative integer. It specifies the maximum path length, circular environment
radius, or shell radius to consider.

Most of the fingerprint types support alternative ways to assign a label to a
given atom type, based on different atom and extended atom properties, which
in turn affects fingerprint generation. The supported `atomLabel` methods are:

  * "CDK_ATOM_TYPES": CDK atom types (eg, 'C.sp2', 'O.minus')
  * "ELEMENT_SYMBOL": element symbol (eg, 'C', 'O')
  * "ELEMENT_NEIGHBOR": element and number of heavy atom neighbors (eg, 'C.2')
  * "ELEMENT_NEIGHBOR_RING": element, ring type, and number of heavy atom
     neighbors (eg, 'C.a.2')
  * "DAYLIGHT_INVARIANT": "Atomic number, number of heavy atom neighbors,
     valency minus the number of connected hydrogens, atomic mass,
     atomic charge, number of connected hydrogens" (eg, '6.2.3.12.0.1'
     for a carbon in a benzole ring)
  * "DAYLIGHT_INVARIANT_RING": DAYLIGHT_INVARIANT followed by a flag
     if the atom is in a ring (eg, '6.2.3.12.0.1.1')

For more information see "jCompoundMapper: An open source Java library and
command-line tool for chemical fingerprints" by Hinselmann, Rosenbaum, Jahn,
Fechner, and Zell, J. Cheminform. 3, 3 (2011)
https://doi.org/10.1186/1758-2946-3-3

Examples:

1) Generate depth-first fingerprint using the default parameters (2048 bit
fingerprints, up to 7 bonds, using the element symbol and number of heavy atom
neighbors as the atom types):

  cdk2fps --type jCMapper-DFS dataset.smi -o dataset.fps

2) Generate 128-bit depth-first fingerprints:

  cdk2fps --type "jCMapper-DFS hashsize=128" dataset.sdf -o dataset.fpb

3) Generate 1024-bit topological atom pair fingerprints using the Daylight
invariants:

  cdk2fps --type "jCMapper-AP2D hashsize=1024 atomLabel=DAYLIGHT_INVARIANT"