chemfp translate

The “chemfp translate” command-line tool uses a supported cheminformatics toolkit to translate between structure file formats.

It can use one of supported third-party cheminformatics toolkits to convert one one format to another, or coordinate the processing between two toolkits so the first reads the input records, which are translated via an intermediate format to the second for output in another format.

There is no equivalent high-level function in the Python API. Instead, use the chemfp.toolkit API to read and write molecule records (see some examples).

For example:

# Pick one
from chemfp import rdkit_toolkit as T
#from chemfp import openeye_toolkit as T
#from chemfp import openbabel_toolkit as T
#from chemfp import cdk_toolkit as T
with T.read_molecules("input.smi") as reader:
  with T.open_molecule_writer("output.sdf") as writer:
    writer.write_molecules(reader)

See also chemfp.toolkit.translate_record() for a way to translate a single record using the toolkit API.

chemfp translate command-line options

The following comes from chemfp translate --help:

Usage: chemfp translate [OPTIONS] [FILENAMES]...

  Translate between two structure file formats

Options:
  -T, --in-toolkit, --in-tk NAME[,NAME...]
                                  One or more comma-separated toolkit names
                                  (default: 'rdkit,openbabel,openeye,cdk')
                                  used to process the input structures and, by
                                  default, generate output structures. The
                                  first available toolkit is used.
  --in FORMAT                     Input structure format (default guesses from
                                  filename, or 'smi')
  --id-tag TAG                    Get the record it from the tag TAG instead
                                  of the first line of the record.
  --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                  Forces '-R delimiter=VALUE'.
  --has-header                    Skip the first line of a SMILES or InChI
                                  file. Forces '-R has_header=1'.
  -R NAME=VALUE                   Specify a reader argument
  --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                  support for CXSMILES extensions. Forces '-R
                                  cxsmiles=1' or '-R cxsmiles=0'.
  --in-encoding CODEC             Text encoding of the input file (default:
                                  'utf8'). Unsupported by most toolkits.
  --in-encoding-errors [strict|ignore|replace|backslashreplace]
                                  Specify error handling behavior when
                                  decoding the input file
  -o, --output FILENAME           Save the structures to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output structure format (default guesses
                                  from filename, or 'smi')
  -W NAME=VALUE                   Specify a writer argument
  --out-encoding CODEC            Text encoding of the output file (default:
                                  'utf8'). Unsupported by most toolkits.
  --out-encoding-errors [strict|ignore|replace|backslashreplace|xmlcharrefreplace|namereplace]
                                  Specify error handling behavior when
                                  encoding the output file
  -U, --out-toolkit, --out-tk NAME[,NAME...]
                                  Uses a two-step process to translate the
                                  input records. Each input record is
                                  processed with `--in-toolkit`, converted to
                                  `--via` format, then reparsed by `--out-
                                  toolkit` and converted to `--out` format.
                                  The `--out-toolkit` accepts a comma-
                                  separated list of toolkit names. The first
                                  available toolkit is used.
  --via FORMAT                    Intermediate structure format used to
                                  transfer the structure data from `--in-
                                  toolkit` to `--out-toolkit`. (Default:
                                  'sdf'. )
  --errors [strict|report|ignore]
                                  How should structure parse errors be
                                  handled? (default=report)
  --help                          Show this message and exit.

  This program converts between two different molecular structure file
  formats, either using a single cheminformatics toolkits, or using one
  toolkit to read and another to write, with the structure information passed
  between them using an intermediate structure format.

  The program reads the input file using the first available toolkit in the
  list specified by `--in-toolkit`. The default is
  "rdkit,openbabel,openeye,cdk", which first tries RDKit, then Open Babel,
  then OEChem, and lastly CDK. Alternatively, this option can be specified
  using the CHEMFP_TOOLKIT environment variable.

  It will attempt to infer the input structure file format based on the
  filename extension, or assume "smi" input. Use `--in` to specify the format.
  Use `--id-tag` to have the SDF reader gets the id from the named SDF record
  tag instead of the title line.

  If the input is a SMILES file then `--cxsmiles` (which is set by default)
  will attempt to parse it as CXSMILES. Use `--no-cxsmiles` to parse it as a
  regular SMILES file. The SMILES readers assume the first line of the input
  contains a structure. Use `--no-header` to have them skip the first record.
  Finally, the default `--delimiter` style of 'to-eol' parses the SMILES,
  optionally followed a space chararacter and the CXSMILES extension, and
  treats the rest of the line as the record id. If the rest of the line is
  space-, tab- or whitespace-separated then specify 'space', 'tab', or
  'whitespace' as the `-delimiter`, respectively, or 'native' to use the
  toolkit's native style.

  The program will also attempt to infer the output structure file format
  based on the filename extension (if specified by `--output` / `-o`), or
  assume "smi" output. Use `--out` to specify the format. The default SMILES
  writers do not include CXSMILES extensions.

  Use the `-R` and `-W` options to configure reader and writer settings. For
  example, `-W cxsmiles=1` will enable generating CXSMILES output.

  Some of the toolkits, most notably chemfp's own "text" toolkit, support
  alternative character encodings. The default encoding, supported by all of
  the underlying chemistry toolkits, is "utf8". Use `--in-encoding` to specify
  the input encoding and `--out-encoding` for the output encoding. The `--in-
  encoding-errors` and `--out-encoding-errors` specify how to handle errors
  during input decoding and output encoding. For a description of what they
  do, see https://docs.python.org/3/library/codecs.html#error-handlers .

  NOTE: only extended ASCII encodings are supported.

  Alternatively the processing can be done in two parts, with one toolkit to
  read the input file and convert each record to an intermediate `--via`
  format, which is then parsed by a second toolkit to generate the output
  records.

  The `-R` and `-W` options affect the readers and writers for both toolkits.
  Use the setting namespacing to be more specific about which toolkit and/or
  format gets which settings, like '-R smi.delimiter=whitespace' to affect
  only the delimiter setting of the 'smi' reader format, or '-W
  "openeye.*.cxsmiles=1'. to set 'cxsmiles' to 1 for all of the relevant
  OEChem writers.

  The `--errors` option specifies how to handle errors during processing. The
  default 'report' prints error information to stderr and keeps processing.
  Use 'ignore' to not generate error information, or 'strict' to stop
  processing at the points. Note: translation will exit if the first 100
  records all had errors, even with 'ignore'.

  Examples:

  1) Use RDKit to read SMILES from stdin and generate SDF to stdout:

    % echo "C methane" | chemfp translate -U rdkit --out sdf

  2) Canonicalize a SMILES file using Open Babel:

    % chemfp translate -T openbabel input.smi -o output.smi

  3) Use OEChem to read an XYZ file and generate SMILES to stdout:

    % chemfp translate -T openeye chembl607364.xyz --out smi

  4) Use the text toolkit to remove CXSmiles annotations from a SMILES file:

    % chemfp translate -T text contains_cx_extensions.smi

  5) Read a Latin-1 encoded SDF and convert to UTF-8:

    % chemfp translate -T text --in-encoding latin1 latin1.sdf -o utf8.sdf

  6) Read a Latin-1 encoded SDF with the id stored in the tag 'name' and
  convert to SMILES using RDKit. (RDKit does not support Latin-1 directly, so
  use the text toolkit to read the records internally to UTF-8 encoded SDF
  records, then use RDKit to convert those records to SMILES):

    % chemfp translate -T text --in-encoding latin1 --id-tag name \
        latin1.sdf -U rdkit