sdf2fps

The “sdf2fps” command-line tool (also available as the “chemfp sdf2fps” subcommand) extracts the id and fingerprint from the title line and/or data items of each record in an SDF and outputs them in a fingerprint file format.

The chemfp.sdf2fps() function implements similar functionality in the Python API, as with this example.

The rest of this chapter contains the output from sdf2fps --help.

sdf2fps command-line options

The following comes from sdf2fps --help:

Usage: sdf2fps [OPTIONS] [FILENAMES]...

  Extract fingerprints from an SDF tag.

Options:
  --id-tag TAG                    Get the record it from the tag TAG instead
                                  of the first line of the record.
  --fp-tag TEXT                   Get the fingerprint from tag TAG (required)
  --in FORMAT                     The input format (one of "sdf", "sdf.gz", or
                                  "sdf.zst")
  --num-bits INT                  Use the first INT bits of the input. Use
                                  only when the last 1-7 bits of the last byte
                                  are not part of the fingerprint. Unexpected
                                  errors will occur if these bits are not all
                                  zero.  [x>=1]
  --errors [strict|report|ignore]
                                  How should structure parse errors be
                                  handled? (default=strict)
  --software TEXT                 Use TEXT as the software description
  --type TEXT                     Use TEXT as the fingerprint type description
  --binary                        Encoded with the characters '0' and '1'. Bit
                                  #0 comes first. Example: 00100000 encodes
                                  the value 4
  --binary-msb                    Encoded with the characters '0' and '1'. Bit
                                  #0 comes last. Example: 00000100 encodes the
                                  value 4
  --hex                           Hex encoded. Bit #0 is the first bit (1<<0)
                                  of the first byte. Example: 01f2 encodes the
                                  value \x01\xf2 = 498
  --hex-lsb                       Hex encoded. Bit #0 is the eigth bit (1<<7)
                                  of the first byte. Example: 804f encodes the
                                  value \x01\xf2 = 498
  --hex-msb                       Hex encoded. Bit #0 is the first bit (1<<0)
                                  of the last byte. Example: f201 encodes the
                                  value \x01\xf2 = 498
  --base64                        Base-64 encoded. Bit #0 is first bit (1<<0)
                                  of first byte. Example: AfI= encodes value
                                  \x01\xf2 = 498
  --cactvs                        CACTVS encoding, based on base64 and
                                  includes a version and bit length
  --daylight                      Daylight encoding, which is a base64 variant
  --decoder DECODER               Import and use the DECODER function to
                                  decode the fingerprint
  --pubchem                       decode CACTVS substructure keys used in
                                  PubChem. Same as --software=CACTVS/unknown
                                  --type 'CACTVS-E_SCREEN/1.0 extended=2'
                                  --fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output format, one of 'fps', 'fps.gz',
                                  'fps.zst', 'fpb', or 'flush' (default
                                  guesses from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --license-check                 Check the license and report results to
                                  stdout.
  --version                       Show the version and exit.
  --license-file FILENAME         Specify a chemfp license file
  --traceback                     Print the traceback on KeyboardInterrupt
  --version                       Show the version and exit.
  --help                          Show this message and exit.

  Examples:

  1) Process the PubChem file Compound_016000001_016500000.sdf.gz to extract
  the PubChem/CACTVS fingerprints, with the title as the id:

    sdf2fps --pubchem Compound_016000001_016500000.sdf

  2) Process stdin to extract a hex-encoded fingerprint in the "CIRCULAR" tag
  and get the id from from the "SMILES" tag. Save the results to
  "circular.fpb":

    sdf2fps --hex --id-tag CIRCULAR --fp-tag SMILES -o circular.fpbe