chemfp csv2fps

The “chemfp csv2fps” command-line tool processes a CSV file to generate fingerprints. It can extract ids and fingerprints from CSV columns, or use ids and structure records and use a cheminformatics toolkit to convert the structure into the specified fingerprint type.

This functionality is also available from Python using the csv_readers module, with an example at chemfp.csv_readers module.

Ths rest of this chapter contains the output from chemfp csv2fps --help.

chemfp csv2fps command-line options

The following comes from chemfp csv2fps --help:

Usage: chemfp csv2fps [OPTIONS] [FILENAMES]...

  Generate fingerprints from fields of a CSV file

Options:
  --id-column, --id-col COL       Column containing the record identifier, as
                                  column number or title (default: 1)
  --molecule-column, --mol-column, --mol-col COL
                                  Column containing the molecular structure,
                                  as column number or title (default: 2)
  --fingerprint-column, --fp-column, --fp-col COL
                                  Column containing the fingerprint, as column
                                  number or title (default: 2)
  --id-from-molecule, --id-from-mol
                                  If specified, get the identifier from the
                                  parsed molecule column instead of --id-col
  --type TYPE_STR                 The chemfp type string used to generate
                                  fingerprints from a molecule (default:
                                  'RDKit-Morgan')
  --using FILENAME                Get the fingerprint type from the metadata
                                  of a fingerprint file
  -d, --dialect STR               'auto' to use the filename (or default to
                                  'csv'), 'csv' for comma-separated, 'tsv' for
                                  tab-separated, or one of the dialects from
                                  Python's csv module (default: 'auto')
  --has-header / --no-header      With --has-header (the default), the first
                                  line contains column titles. Use --no-header
                                  if there is no title line.
  --errors [strict|report|ignore]
                                  Describe how to handle errors when parsing a
                                  molecule or fingerprint. If 'strict', write
                                  an error message (to stderr) and stop
                                  processing. If 'report', print an error
                                  message and continue processing. If
                                  'ignore', skip and continue processing.
                                  Default is 'report' for molecules and
                                  'strict' for fingerprints.
  --encoding CODEC                Specify the character encoding type. Default
                                  is 'utf8'. Other common options include
                                  'latin1', 'utf16', and 'cp1252'.
  --encoding-errors [strict|ignore|replace|backslashreplace]
                                  Specify how to handle character encoding
                                  errors. Use 'strict' to stop, 'ignore' to
                                  ignore', 'replace' to substitute with '�',
                                  and 'backslashreplace' for backslashed
                                  escape sequences. (default: strict)
  --in-compression [auto|none|gz|zst]
                                  Input compression format. The default of
                                  'auto' uses the filename extension. Specify
                                  'none' for uncompressed, 'gz' for gzip, and
                                  'zst' for ZStandard.
  --csv-errors [strict|report|ignore]
                                  If a required column is missing from a row,
                                  the default 'strict' treats the failure as
                                  an error. Use 'ignore' to silently skip all
                                  errors, and 'report' to print information to
                                  stderr and continue processing.
  --describe                      Show details about the files to process
                                  (dialect, titles, first row) but do not
                                  process the files.
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --version                       Show the version and exit.
  --help                          Show this message and exit.

Low-level dialect configuration:
  --separator, --sep CHAR         The character used to separate CSV fields
  --doublequote / --no-doublequote
                                  If true, unescape doubled --quotechar to a
                                  single quotechar
  --escapechar CHAR               If specified, this character removes any
                                  special meaning from the following character
  --quotechar CHAR                The character used to quote fields
                                  containing special characters, including
                                  newline
  --quoting [minimal|none]        If 'none' then do not process quote
                                  characters.
  --skipinitialspace / --no-skipinitialspace
                                  If True, ignore spaces immediately following
                                  the separator

Structure parsing options:
  --format NAME               Molecule column format name (default: 'smi')
  --id-tag TAG                Tag name containing the record id (SDF columns
                              only). Using this also enables --id-from-
                              molecule.
  --delimiter VALUE           Delimiter style for SMILES and InChI records.
                              Forces '-R delimiter=VALUE'.
  -R NAME=VALUE               Specify a reader argument used to configure the
                              structure parser
  --cxsmiles / --no-cxsmiles  Use --no-cxsmiles to disable the default support
                              for CXSMILES extensions. Forces '-R cxsmiles=1'
                              or '-R cxsmiles=0'.

Fingerprint decoders:
  --binary           Encoded with the characters '0' and '1'. Bit #0 comes
                     first. Example: 00100000 encodes the value 4
  --binary-msb       Encoded with the characters '0' and '1'. Bit #0 comes
                     last. Example: 00000100 encodes the value 4
  --hex              Hex encoded. Bit #0 is the first bit (1<<0) of the first
                     byte. Example: 01f2 encodes the value \x01\xf2 = 498
  --hex-lsb          Hex encoded. Bit #0 is the eigth bit (1<<7) of the first
                     byte. Example: 804f encodes the value \x01\xf2 = 498
  --hex-msb          Hex encoded. Bit #0 is the first bit (1<<0) of the last
                     byte. Example: f201 encodes the value \x01\xf2 = 498
  --base64           Base-64 encoded. Bit #0 is first bit (1<<0) of first
                     byte. Example: AfI= encodes value \x01\xf2 = 498
  --cactvs           CACTVS encoding, based on base64 and includes a version
                     and bit length
  --daylight         Daylight encoding, which is a base64 variant
  --decoder DECODER  Import and use the DECODER function to decode the
                     fingerprint

Output:
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --type-str STR                  When extracting fingerprints, the string to
                                  use for the output '#type' header

  This program processes a CSV file to create a fingerprint file. The
  fingerprints may come from a column containing a structure record like
  SMILES or InChI, or a column containing pre-computed fingerprints in one of
  several encodings. If it is a molecule, this program will use the
  appropriate toolkit to generate fingerprints from the specified fingerprint
  type. If it is a pre-compute fingerprint, this program will decode it as
  specified.

  # Dialects

  There are many dialects of the CSV format. Use `--dialect` to specify one of
  the registered dialects. These are "csv" (or "excel") for an Excel-style
  comma-separated file, and "tsv` (or "excel-tab") for an Excel-style tab-
  separated file. These are equivalent to the "excel" and "excel-tab" dialects
  from the Python's "csv" module, at
  https://docs.python.org/3/library/csv.html . The registered Python dialect
  'unix' is also supported.

  Alternatively, the low-level CSV options can be changed using `--separator`,
  `--doublequote' / '--no-doublequote', '--escapechar', '--quotechar', '--
  quoting', and '--skipinitialspace' / '--no-skipinitialspace'. These options
  start with the specified `--dialect` then modify the appropraite settings.
  See the Python csv module documentation for details. Note: the options which
  take a CHAR expect either a single character or one of the special names
  "tab", "backslash", "space", "quote", "doublequote", "singlequote", or
  "bang".

  # Molecule processing

  By default the program expects the identifier in the first column and the
  molecule in the second, and it expect the first line contains column titles.
  The default is to process the molecules as SMILES using RDKit to generate
  "RDKit-Morgan" fingerprints, and write the fingerprints to stdout in FPS
  format.

  Use `--type` to specify the fingerprint type as a chemfp fingerprint string,
  or use `--using` to get the fingerprint type from the metadata of an
  existing fingerprint file. There is no need to specify which toolkit to use
  as chemfp can determine that from the fingerprint type.

  Use `--no-header` if the first line does not contain column titles. (The
  default is `--has-header`.)

  If the input structures are not in "smi" format use `--format` to specify
  the correct one. For most cases this will be "smi", "smistring", "inchi", or
  "inchistring", though "molfile", "sdf", and other formats are also possible,
  depending on the toolkit. The default `--cxsmiles` will also parse (or
  ignore) CXSMILES extensions. Use `--no-cxsmiles` to disable that option.

  If the record id is stored in the structure record, rather than as one of
  the columns in the input file, then `--id-from-molecule` to have csv2fps
  extract the id from the structure record. The `--delimiter` option affects
  how to parse the title from a SMILES file, and the `--id-tag` specifies
  which SDF record tag contains the id.

  If the id and molecule are not in the first and second columns,
  respectively, then use `--id-column` and `--molecule-column` to specify a
  different location. If the value is an integer, or "#" followed by an
  integer, then is the integer is treated as a column number; the first column
  is column #1. If the value starts with '@' followed by a string, or the
  value is anything other string, then the string is treated as a column
  title. Column titles cannot be specified with `--no-header`.

  For examples:

    --id-column 3  -- id comes from the third column

    --mol-column 4  -- molecule comes from the fourth column

    --id-col name  -- id comes from the column with title 'name'

    --mol-col @9  -- molecule comes from the column with title '9'

    --mol-col #9  -- molecule come from the ninth column

  # Fingerprint processing

  If `--fingerprint-column` is specified (in which case `--molecule-column`
  must not be specified) then is the column containing pre-computed
  fingerprints. By default csv2fps will parse them as hex-encoded
  fingerprints. See the "Fingerprint decoders" section for alternative
  decoders.

  # Processing errors

  The `--errors` option describes how to handle structure processing errors.
  The default of "report" prints an error message to stderr and skips to then
  next record. "ignore" does not print an error message, and "strict" prints
  an error message an exists.

  The `--csv-errors` option describes how to handle CSV processing errors,
  like when the specified column does not exist for a given row. The options
  are the same as `--errors` but the default is "strict".

  If each of the first 100 records contain errors then csv2fps will give an
  error message and stop processing, even with "ignore".

  # Encodings

  If the input file is not UTF-8 encoded then use `--encoding` to specify the
  encoding type, like "utf16" or "cp1252". See the full list at
  https://docs.python.org/3/library/codecs.html#standard-encodings

  Use `--encoding-errors` to describe how to handle input which could not be
  decoded. For a description of the different options see
  https://docs.python.org/3/library/codecs.html#error-handlers

  # "sniff" dialect

  The special "sniff" dialect inspects the start of the input file to attempt
  to guess the format. It is not accurate enough to trust for all of the
  input, but it may be useful as an initial attempt, especially when combined
  with the `--describe` option.

  # --describe

  The `--describe` option prints an overview of each file instead of
  generating fingerprints. It prints the dialect details, the column titles
  (unless `--no-header` is used), and the contents of the first data line, if
  present. When combined with `--dialect sniff` this gives insight to how to
  process a previously unseen CSV file.

  # Examples:

  1) See the description of a MolPort file:

    % chemfp csv2fps --dialect sniff --describe \
    fulldb_smiles-000-000-000--000-499-999.txt.gz

  2) Process the MolPort file to generate OECircular fingerprints.  Use the
  'MOLPORTID' column for the identifiers as the 'SMILES' column for the
  structures. Save the result to 'molport.fps':

    % chemfp csv2fps --dialect tsv --id-col MOLPORTID --mol-col SMILES \
    --type OpenEye-Circular -o molport.fps \      fulldb_smiles-000-000-000--
    000-499-999.txt.gz

  3) Process the MolPort file to generate RDKit Morgan fingerprints from the
  InChI column, use column #3 (MOLPORTID) for the ids, and send the results to
  stdout:

    % chemfp csv2fps --dialect tsv --id-col #3 --mol-col STANDARD_INCHI \
    --format inchistring fulldb_smiles-000-000-000--000-499-999.txt.gz