chemfp csv2fps¶
The “chemfp csv2fps” command-line tool processes a CSV file to generate fingerprints. It can extract ids and fingerprints from CSV columns, or use ids and structure records and use a cheminformatics toolkit to convert the structure into the specified fingerprint type.
This functionality is also available from Python using the
csv_readers
module, with an example at
chemfp.csv_readers module.
Ths rest of this chapter contains the output from chemfp csv2fps --help.
chemfp csv2fps command-line options¶
The following comes from chemfp csv2fps --help
:
Usage: chemfp csv2fps [OPTIONS] [FILENAMES]...
Generate fingerprints from fields of a CSV file
Options:
--id-column, --id-col COL Column containing the record identifier, as
column number or title (default: 1)
--molecule-column, --mol-column, --mol-col COL
Column containing the molecular structure,
as column number or title (default: 2)
--fingerprint-column, --fp-column, --fp-col COL
Column containing the fingerprint, as column
number or title (default: 2)
--id-from-molecule, --id-from-mol
If specified, get the identifier from the
parsed molecule column instead of --id-col
--type TYPE_STR The chemfp type string used to generate
fingerprints from a molecule (default:
'RDKit-Morgan')
--using FILENAME Get the fingerprint type from the metadata
of a fingerprint file
-d, --dialect STR 'auto' to use the filename (or default to
'csv'), 'csv' for comma-separated, 'tsv' for
tab-separated, or one of the dialects from
Python's csv module (default: 'auto')
--has-header / --no-header With --has-header (the default), the first
line contains column titles. Use --no-header
if there is no title line.
--errors [strict|report|ignore]
Describe how to handle errors when parsing a
molecule or fingerprint. If 'strict', write
an error message (to stderr) and stop
processing. If 'report', print an error
message and continue processing. If
'ignore', skip and continue processing.
Default is 'report' for molecules and
'strict' for fingerprints.
--encoding CODEC Specify the character encoding type. Default
is 'utf8'. Other common options include
'latin1', 'utf16', and 'cp1252'.
--encoding-errors [strict|ignore|replace|backslashreplace]
Specify how to handle character encoding
errors. Use 'strict' to stop, 'ignore' to
ignore', 'replace' to substitute with '�',
and 'backslashreplace' for backslashed
escape sequences. (default: strict)
--in-compression [auto|none|gz|zst]
Input compression format. The default of
'auto' uses the filename extension. Specify
'none' for uncompressed, 'gz' for gzip, and
'zst' for ZStandard.
--csv-errors [strict|report|ignore]
If a required column is missing from a row,
the default 'strict' treats the failure as
an error. Use 'ignore' to silently skip all
errors, and 'report' to print information to
stderr and continue processing.
--describe Show details about the files to process
(dialect, titles, first row) but do not
process the files.
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--version Show the version and exit.
--help Show this message and exit.
Low-level dialect configuration:
--separator, --sep CHAR The character used to separate CSV fields
--doublequote / --no-doublequote
If true, unescape doubled --quotechar to a
single quotechar
--escapechar CHAR If specified, this character removes any
special meaning from the following character
--quotechar CHAR The character used to quote fields
containing special characters, including
newline
--quoting [minimal|none] If 'none' then do not process quote
characters.
--skipinitialspace / --no-skipinitialspace
If True, ignore spaces immediately following
the separator
Structure parsing options:
--format NAME Molecule column format name (default: 'smi')
--id-tag TAG Tag name containing the record id (SDF columns
only). Using this also enables --id-from-
molecule.
--delimiter VALUE Delimiter style for SMILES and InChI records.
Forces '-R delimiter=VALUE'.
-R NAME=VALUE Specify a reader argument used to configure the
structure parser
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default support
for CXSMILES extensions. Forces '-R cxsmiles=1'
or '-R cxsmiles=0'.
Fingerprint decoders:
--binary Encoded with the characters '0' and '1'. Bit #0 comes
first. Example: 00100000 encodes the value 4
--binary-msb Encoded with the characters '0' and '1'. Bit #0 comes
last. Example: 00000100 encodes the value 4
--hex Hex encoded. Bit #0 is the first bit (1<<0) of the first
byte. Example: 01f2 encodes the value \x01\xf2 = 498
--hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the first
byte. Example: 804f encodes the value \x01\xf2 = 498
--hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the last
byte. Example: f201 encodes the value \x01\xf2 = 498
--base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first
byte. Example: AfI= encodes value \x01\xf2 = 498
--cactvs CACTVS encoding, based on base64 and includes a version
and bit length
--daylight Daylight encoding, which is a base64 variant
--decoder DECODER Import and use the DECODER function to decode the
fingerprint
Output:
-o, --output FILENAME Save the fingerprints to FILENAME
(default=stdout)
--out FORMAT Output structure format (default guesses
from output filename, or is 'fps')
--include-metadata / --no-metadata
With --no-metadata, do not include the
header metadata for FPS output.
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--type-str STR When extracting fingerprints, the string to
use for the output '#type' header
This program processes a CSV file to create a fingerprint file. The
fingerprints may come from a column containing a structure record like
SMILES or InChI, or a column containing pre-computed fingerprints in one of
several encodings. If it is a molecule, this program will use the
appropriate toolkit to generate fingerprints from the specified fingerprint
type. If it is a pre-compute fingerprint, this program will decode it as
specified.
# Dialects
There are many dialects of the CSV format. Use `--dialect` to specify one of
the registered dialects. These are "csv" (or "excel") for an Excel-style
comma-separated file, and "tsv` (or "excel-tab") for an Excel-style tab-
separated file. These are equivalent to the "excel" and "excel-tab" dialects
from the Python's "csv" module, at
https://docs.python.org/3/library/csv.html . The registered Python dialect
'unix' is also supported.
Alternatively, the low-level CSV options can be changed using `--separator`,
`--doublequote' / '--no-doublequote', '--escapechar', '--quotechar', '--
quoting', and '--skipinitialspace' / '--no-skipinitialspace'. These options
start with the specified `--dialect` then modify the appropraite settings.
See the Python csv module documentation for details. Note: the options which
take a CHAR expect either a single character or one of the special names
"tab", "backslash", "space", "quote", "doublequote", "singlequote", or
"bang".
# Molecule processing
By default the program expects the identifier in the first column and the
molecule in the second, and it expect the first line contains column titles.
The default is to process the molecules as SMILES using RDKit to generate
"RDKit-Morgan" fingerprints, and write the fingerprints to stdout in FPS
format.
Use `--type` to specify the fingerprint type as a chemfp fingerprint string,
or use `--using` to get the fingerprint type from the metadata of an
existing fingerprint file. There is no need to specify which toolkit to use
as chemfp can determine that from the fingerprint type.
Use `--no-header` if the first line does not contain column titles. (The
default is `--has-header`.)
If the input structures are not in "smi" format use `--format` to specify
the correct one. For most cases this will be "smi", "smistring", "inchi", or
"inchistring", though "molfile", "sdf", and other formats are also possible,
depending on the toolkit. The default `--cxsmiles` will also parse (or
ignore) CXSMILES extensions. Use `--no-cxsmiles` to disable that option.
If the record id is stored in the structure record, rather than as one of
the columns in the input file, then `--id-from-molecule` to have csv2fps
extract the id from the structure record. The `--delimiter` option affects
how to parse the title from a SMILES file, and the `--id-tag` specifies
which SDF record tag contains the id.
If the id and molecule are not in the first and second columns,
respectively, then use `--id-column` and `--molecule-column` to specify a
different location. If the value is an integer, or "#" followed by an
integer, then is the integer is treated as a column number; the first column
is column #1. If the value starts with '@' followed by a string, or the
value is anything other string, then the string is treated as a column
title. Column titles cannot be specified with `--no-header`.
For examples:
--id-column 3 -- id comes from the third column
--mol-column 4 -- molecule comes from the fourth column
--id-col name -- id comes from the column with title 'name'
--mol-col @9 -- molecule comes from the column with title '9'
--mol-col #9 -- molecule come from the ninth column
# Fingerprint processing
If `--fingerprint-column` is specified (in which case `--molecule-column`
must not be specified) then is the column containing pre-computed
fingerprints. By default csv2fps will parse them as hex-encoded
fingerprints. See the "Fingerprint decoders" section for alternative
decoders.
# Processing errors
The `--errors` option describes how to handle structure processing errors.
The default of "report" prints an error message to stderr and skips to then
next record. "ignore" does not print an error message, and "strict" prints
an error message an exists.
The `--csv-errors` option describes how to handle CSV processing errors,
like when the specified column does not exist for a given row. The options
are the same as `--errors` but the default is "strict".
If each of the first 100 records contain errors then csv2fps will give an
error message and stop processing, even with "ignore".
# Encodings
If the input file is not UTF-8 encoded then use `--encoding` to specify the
encoding type, like "utf16" or "cp1252". See the full list at
https://docs.python.org/3/library/codecs.html#standard-encodings
Use `--encoding-errors` to describe how to handle input which could not be
decoded. For a description of the different options see
https://docs.python.org/3/library/codecs.html#error-handlers
# "sniff" dialect
The special "sniff" dialect inspects the start of the input file to attempt
to guess the format. It is not accurate enough to trust for all of the
input, but it may be useful as an initial attempt, especially when combined
with the `--describe` option.
# --describe
The `--describe` option prints an overview of each file instead of
generating fingerprints. It prints the dialect details, the column titles
(unless `--no-header` is used), and the contents of the first data line, if
present. When combined with `--dialect sniff` this gives insight to how to
process a previously unseen CSV file.
# Examples:
1) See the description of a MolPort file:
% chemfp csv2fps --dialect sniff --describe \
fulldb_smiles-000-000-000--000-499-999.txt.gz
2) Process the MolPort file to generate OECircular fingerprints. Use the
'MOLPORTID' column for the identifiers as the 'SMILES' column for the
structures. Save the result to 'molport.fps':
% chemfp csv2fps --dialect tsv --id-col MOLPORTID --mol-col SMILES \
--type OpenEye-Circular -o molport.fps \ fulldb_smiles-000-000-000--
000-499-999.txt.gz
3) Process the MolPort file to generate RDKit Morgan fingerprints from the
InChI column, use column #3 (MOLPORTID) for the ids, and send the results to
stdout:
% chemfp csv2fps --dialect tsv --id-col #3 --mol-col STANDARD_INCHI \
--format inchistring fulldb_smiles-000-000-000--000-499-999.txt.gz