chemfp translate¶
The “chemfp translate” command-line tool uses a supported cheminformatics toolkit to translate between structure file formats.
It can use one of supported third-party cheminformatics toolkits to convert one one format to another, or coordinate the processing between two toolkits so the first reads the input records, which are translated via an intermediate format to the second for output in another format.
There is no equivalent high-level function in the Python API. Instead,
use the chemfp.toolkit
API to read and write molecule records
(see some examples).
For example:
# Pick one
from chemfp import rdkit_toolkit as T
#from chemfp import openeye_toolkit as T
#from chemfp import openbabel_toolkit as T
#from chemfp import cdk_toolkit as T
with T.read_molecules("input.smi") as reader:
with T.open_molecule_writer("output.sdf") as writer:
writer.write_molecules(reader)
See also chemfp.toolkit.translate_record()
for a way to
translate a single record using the toolkit API.
chemfp translate command-line options¶
The following comes from chemfp translate --help
:
Usage: chemfp translate [OPTIONS] [FILENAMES]...
Translate between two structure file formats
Options:
-T, --in-toolkit, --in-tk NAME[,NAME...]
One or more comma-separated toolkit names
(default: 'rdkit,openbabel,openeye,cdk')
used to process the input structures and, by
default, generate output structures. The
first available toolkit is used.
--in FORMAT Input structure format (default guesses from
filename, or 'smi')
--id-tag TAG Get the record it from the tag TAG instead
of the first line of the record.
--delimiter VALUE Delimiter style for SMILES and InChI files.
Forces '-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI
file. Forces '-R has_header=1'.
-R NAME=VALUE Specify a reader argument
--cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default
support for CXSMILES extensions. Forces '-R
cxsmiles=1' or '-R cxsmiles=0'.
--in-encoding CODEC Text encoding of the input file (default:
'utf8'). Unsupported by most toolkits.
--in-encoding-errors [strict|ignore|replace|backslashreplace]
Specify error handling behavior when
decoding the input file
-o, --output FILENAME Save the structures to FILENAME
(default=stdout)
--out FORMAT Output structure format (default guesses
from filename, or 'smi')
-W NAME=VALUE Specify a writer argument
--out-encoding CODEC Text encoding of the output file (default:
'utf8'). Unsupported by most toolkits.
--out-encoding-errors [strict|ignore|replace|backslashreplace|xmlcharrefreplace|namereplace]
Specify error handling behavior when
encoding the output file
-U, --out-toolkit, --out-tk NAME[,NAME...]
Uses a two-step process to translate the
input records. Each input record is
processed with `--in-toolkit`, converted to
`--via` format, then reparsed by `--out-
toolkit` and converted to `--out` format.
The `--out-toolkit` accepts a comma-
separated list of toolkit names. The first
available toolkit is used.
--via FORMAT Intermediate structure format used to
transfer the structure data from `--in-
toolkit` to `--out-toolkit`. (Default:
'sdf'. )
--errors [strict|report|ignore]
How should structure parse errors be
handled? (default=report)
--help Show this message and exit.
This program converts between two different molecular structure file
formats, either using a single cheminformatics toolkits, or using one
toolkit to read and another to write, with the structure information passed
between them using an intermediate structure format.
The program reads the input file using the first available toolkit in the
list specified by `--in-toolkit`. The default is
"rdkit,openbabel,openeye,cdk", which first tries RDKit, then Open Babel,
then OEChem, and lastly CDK. Alternatively, this option can be specified
using the CHEMFP_TOOLKIT environment variable.
It will attempt to infer the input structure file format based on the
filename extension, or assume "smi" input. Use `--in` to specify the format.
Use `--id-tag` to have the SDF reader gets the id from the named SDF record
tag instead of the title line.
If the input is a SMILES file then `--cxsmiles` (which is set by default)
will attempt to parse it as CXSMILES. Use `--no-cxsmiles` to parse it as a
regular SMILES file. The SMILES readers assume the first line of the input
contains a structure. Use `--no-header` to have them skip the first record.
Finally, the default `--delimiter` style of 'to-eol' parses the SMILES,
optionally followed a space chararacter and the CXSMILES extension, and
treats the rest of the line as the record id. If the rest of the line is
space-, tab- or whitespace-separated then specify 'space', 'tab', or
'whitespace' as the `-delimiter`, respectively, or 'native' to use the
toolkit's native style.
The program will also attempt to infer the output structure file format
based on the filename extension (if specified by `--output` / `-o`), or
assume "smi" output. Use `--out` to specify the format. The default SMILES
writers do not include CXSMILES extensions.
Use the `-R` and `-W` options to configure reader and writer settings. For
example, `-W cxsmiles=1` will enable generating CXSMILES output.
Some of the toolkits, most notably chemfp's own "text" toolkit, support
alternative character encodings. The default encoding, supported by all of
the underlying chemistry toolkits, is "utf8". Use `--in-encoding` to specify
the input encoding and `--out-encoding` for the output encoding. The `--in-
encoding-errors` and `--out-encoding-errors` specify how to handle errors
during input decoding and output encoding. For a description of what they
do, see https://docs.python.org/3/library/codecs.html#error-handlers .
NOTE: only extended ASCII encodings are supported.
Alternatively the processing can be done in two parts, with one toolkit to
read the input file and convert each record to an intermediate `--via`
format, which is then parsed by a second toolkit to generate the output
records.
The `-R` and `-W` options affect the readers and writers for both toolkits.
Use the setting namespacing to be more specific about which toolkit and/or
format gets which settings, like '-R smi.delimiter=whitespace' to affect
only the delimiter setting of the 'smi' reader format, or '-W
"openeye.*.cxsmiles=1'. to set 'cxsmiles' to 1 for all of the relevant
OEChem writers.
The `--errors` option specifies how to handle errors during processing. The
default 'report' prints error information to stderr and keeps processing.
Use 'ignore' to not generate error information, or 'strict' to stop
processing at the points. Note: translation will exit if the first 100
records all had errors, even with 'ignore'.
Examples:
1) Use RDKit to read SMILES from stdin and generate SDF to stdout:
% echo "C methane" | chemfp translate -U rdkit --out sdf
2) Canonicalize a SMILES file using Open Babel:
% chemfp translate -T openbabel input.smi -o output.smi
3) Use OEChem to read an XYZ file and generate SMILES to stdout:
% chemfp translate -T openeye chembl607364.xyz --out smi
4) Use the text toolkit to remove CXSmiles annotations from a SMILES file:
% chemfp translate -T text contains_cx_extensions.smi
5) Read a Latin-1 encoded SDF and convert to UTF-8:
% chemfp translate -T text --in-encoding latin1 latin1.sdf -o utf8.sdf
6) Read a Latin-1 encoded SDF with the id stored in the tag 'name' and
convert to SMILES using RDKit. (RDKit does not support Latin-1 directly, so
use the text toolkit to read the records internally to UTF-8 encoded SDF
records, then use RDKit to convert those records to SMILES):
% chemfp translate -T text --in-encoding latin1 --id-tag name \
latin1.sdf -U rdkit