Toolkit API examples¶
This chapter gives many examples of how to use the toolkit API. For an
overview of the toolkit API functions, see chemfp.toolkit
. For
details about actual toolkit implementations, see
chemfp.cdk_toolkit
,
chemfp.openeye_toolkit
, chemfp.openbabel_toolkit
,
chemfp.rdkit_toolkit
, and chemfp.text_toolkit
.
Fingerprint search usually starts with a structure record, and not a
fingerprint. The functions
chemfp.read_molecule_fingerprints()
and
chemfp.read_molecule_fingerprints_from_string()
give a
quick way to read a file or string containing structure records as the
corresponding fingerprints.
Sometimes you want more control over the process. You might want to generate multiple fingerprints for the same structure and not want to reparse the structure record multiple times. Or you might want to return the search results as extra fields to the query SDF record instead of a simple list of values.
Chemfp uses a third-party chemistry toolkit to parse the records into a molecule, or compute the fingerprint for a given molecule. It’s not hard to write your own Open Babel, OEChem/OEGraphSim, or RDKit code to handle any of these or similar tasks. The problem comes in when you want to mix multiple fingerprint types, like to compare the default RDKit fingerprint to Open Babel’s FP2 fingerprint. You end up writing very different code for essentially the same fingerprint task.
There’s an old saying in computer science; all problems can be solved with another layer of indirection. The chemfp toolkit API follows that tradition. It’s a common API for molecule I/O which works across the four supported toolkits. It’s also my best effort at implementing a next generation API.
Bear in mind that it is only an I/O API. Chemfp is a fingerprint toolkit and it does not have a common molecule API. For that, look toward Cinfony.
Get a chemfp toolkit¶
In this section you’ll learn how to load a “toolkit” – a portable API layer above the actual chemistry toolkit – and how to check if a toolkit is available and has a valid license.
Each toolkit I/O adapter is implemented as a chemfp submodule. If you know the underlying chemistry toolkit is installed you can import the adapter directly:
>>> from chemfp import openbabel_toolkit
>>> from chemfp import openeye_toolkit
>>> from chemfp import rdkit_toolkit
>>> from chemfp import cdk_toolkit
Use chemfp.get_toolkit_names()
to get the available toolkit
names:
>>> chemfp.get_toolkit_names()
set(['cdk', 'openbabel', 'openeye', 'rdkit'])
This will try to import each module, which means it may take a second or more depending on the shared library load time for your system. (This overhead only occurs once.) The function returns a list of the modules that could be loaded and have a valid license.
You can use chemfp.get_toolkit()
to get the correct toolkit
module given a name; it raises a ValueError if the underlying toolkit
isn’t installed or the toolkit name is unknown:
>>> chemfp.get_toolkit("rdkit")
<module 'chemfp.rdkit_toolkit' from 'chemfp/rdkit_toolkit.py'>
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("openbabel")
<module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.py'>
>>> chemfp.get_toolkit("cdk")
<module 'chemfp.cdk_toolkit' from 'chemfp/cdk_toolkit.py'>
Existence isn’t enough to know if you can use a toolkit. OEChem
requires a license. Each I/O adapter implements
chemfp.toolkit.is_licensed()
. It returns True for Open Babel and
RDKit and the value of OEChemIsLicensed() for OEChem:
>>> for name in chemfp.get_toolkit_names():
... T = chemfp.get_toolkit(name)
... print(f"Toolkit {T.name!r} ({T.software}) is licensed? {T.is_licensed()}")
...
Toolkit 'cdk' (CDK/2.8) is licensed? True
Toolkit 'openbabel' (OpenBabel/3.1.0) is licensed? True
Toolkit 'openeye' (OEChem/20220607) is licensed? True
Toolkit 'rdkit' (RDKit/2021.09.4) is licensed? True
(Thanks OpenEye for an no-cost developer license to their toolkit! Thanks to the Open Babel, RDKit, and CDK developers for an open source license to their toolkits!)
There is currently no way to check if OEGraphSim is licensed; you’ll need to use native OpenEye code instead.
For fun I also showed the software
attribute, which gives more
detailed information about the toolkit version in a standardized
format.
There is still another way to get access to the toolkit modules. The
special objects chemfp.rdkit
, chemfp.cdk
,
chemfp.openeye
, chemfp.openbabel
will automatically
import and redirect to the underlying module, without need for an
explicit import:
>>> chemfp.rdkit.morgan2.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> chemfp.cdk.pubchem.from_smiles("CCO")[:8]
b'\x00\x02\x04\x00\x00\x00\x00\x00'
>>> chemfp.openeye.tree.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> chemfp.openbabel.fp2.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'
These objects should not be imported into your module because it’s all
too easy to confuse them with the actual toolki module name. Instead,
either use chemfp.<toolkit>
or import the actual chemfp wrapper
module.
Finally, use chemfp.has_toolkit()
to check if a toolkit is
available. In the following, I used one of my local testing
environments which has OEChem installed but not the other toolkits. (I
use venv to create
and manage these virtual environments; it’s a very useful tool.):
>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False
>>> chemfp.has_toolkit("rdkit")
False
>>> chemfp.has_toolkit("cdk")
False
The other option is to catch the ValueError raised when trying to get an unavailable toolkit:
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("cdk")
Traceback (most recent call last):
File "chemfp/__init__.py", line 2188, in get_toolkit
from . import cdk_toolkit as tk
File "chemfp/cdk_toolkit.py", line 37, in <module>
import jpype # Must install JPype so chemfp can interface to the CDK jar
^^^^^^^^^^^^
ModuleNotFoundError: No module named 'jpype'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 2191, in get_toolkit
raise ValueError("Unable to get toolkit %r: %s" % (toolkit_name, err))
ValueError: Unable to get toolkit 'cdk': No module named 'jpype'
>>>
>>> chemfp.get_toolkit("schrodinger")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 2199, in get_toolkit
raise ValueError("Toolkit %r is not supported" % (toolkit_name,))
ValueError: Toolkit 'schrodinger' is not supported
This is a bit more complicated to do, but does have the advantage of giving access to an error message.
Parse and create SMILES¶
In this section you’ll learn how to parse a SMILES into a molecule, convert a molecule into a SMILES, and the difference between a SMILES record and a SMILES string. You will need a chemistry toolkit for this and most of the examples in this chapter.
The chemfp toolkit I/O API is the same across the different toolkits, and examples with one will generally work with the other, except for essential differences like the specific formats available, the chemistry differences in how to interpret a record, the error messages, and reader and writer arguments.
For most examples I’ll use T
as the toolkit module name, rather
than specify a specific toolkit. My examples will be based on RDKit,
but you can use any of the following, if available on your system:
# Choose one of these
from chemfp import openeye_toolkit as T
from chemfp import openbabel_toolkit as T
from chemfp import rdkit_toolkit as T
from chemfp import cdk_toolkit as T
I’ll parse the SMILES string for phenol as a toolkit molecule, then convert the toolkit molecule into its canonical isomeric SMILES string.
>>> from chemfp import rdkit_toolkit as T
>>> T.translate_record("c1ccccc1O", in_format="smistring", out_format="smistring")
'Oc1ccccc1'
The “smistring” format name means that the input is a SMILES string and there is no identifier.
The default input and output format is “smi”, which processes a “record” of a SMILES file, which contains both a SMILES and an optional identifier:
>>> T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'
The translate_record
function is a helper based on lower-level
components to parse a record into a molecule object, and to create a
record from a molecule object. Here’s what those look like, using
chemfp.toolkit.parse_molecule()
and
chemfp.toolkit.create_string()
:
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x103559980>
>>> T.create_string(mol, "smistring")
'Oc1ccccc1'
Chemfp follows the rule from the original SMILES paper that the SMILES string ends at the first whitespace. The following is valid across the chemfp toolkits API even if the underlying toolkit doesn’t accept the “junk” as part of a SMILES:
>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")
On the other hand, if you have a SMILES record, which is a SMILES string followed by an id and possibly other fields, then use the “smi” format name. That will parse the first characters as a SMILES string and parse the rest of the input, up to the end of the line, as the record id:
>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")
>>> T.get_id(mol) is None
True
>>> mol = T.parse_molecule("c1ccccc1O junk", "smi")
>>> T.get_id(mol)
'junk'
>>> mol = T.parse_molecule("c1ccccc1O flotsam and jetsam\nand more\n", "smi")
>>> T.get_id(mol)
'flotsam and jetsam'
I used the chemfp.toolkit.get_id()
helper function. While chemfp
doesn’t have a common molecule object, I found I do need a few
standard functions to manipulate toolkit molecules. Also,
toolkit.parse_molecule()
will only read the first record and ignore
trailing data, which is why the “and more” didn’t affect anything.
Now that the molecule has an id, it’s easy to see the difference between the “smistring” and “smi” in the output string:
>>> T.create_string(mol, "smistring")
'Oc1ccccc1'
>>> T.create_string(mol, "smi")
'Oc1ccccc1 flotsam and jetsam\n'
SMILES, SDF and InChI are so common that there are special toolkit functions to parse and create records in those formats, and to read and write files in those formats. Here is another way to parse and create the phenol record as “smi” format:
>>> mol = T.parse_smi("c1ccccc1O phenol")
>>> T.create_smi(mol)
'Oc1ccccc1 phenol\n'
Finally, you can pass an alternate id to the toolkit.create_string()
function. One example of when this is useful is when your identifier
comes from one field of a database and the SMILES string from another,
and you want to combine the results to get an SDF record:
>>> T.create_string(mol, "smi", id="nothing to see here")
'Oc1ccccc1 nothing to see here\n'
WARNING: Chemfp’s toolkit wrapper implementation may temporarily change then restore the toolkit molecule’s own identifier in order to get the correct output. This is not thread-safe.
Canonical, non-isomeric, and arbitrary SMILES¶
In this section you’ll learn the difference between the “smistring”,
“canstring”, and “usmstring” SMILES string formats and the “smi”,
“can”, and “usm” SMILES record formats. As with all examples which use
the generic T
toolkit name, you’ll need one of the supported
chemistry toolkits, and I’ll use RDKit as my underlying toolkit.
The SMILES format supports many different ways to represent the same molecule. “CO”, “OC”, “[OH][CH3]”, and “C3.O3” are four different SMILES strings for methanol. A canonicalization algorithm uses additional rules to create a unique SMILES representation for a given molecular graph. The different chemistry toolkit have different canonicalization algorithms, so each toolkit will likely generate a different canonical SMILES string for the same molecular graph.
There are multiple classes of canonical SMILES strings even in the same toolkit. The original SMILES format did not handle isotopes, chirality, or stereochemistry. The later extension to support these was called “isomeric SMILES”, to distinguish it from the original SMILES.
Because of the history, when people asked a toolkit for “SMILES” output they got non-isomeric non-canonical SMILES, while “canonical SMILES” gave them “non-isomeric canonical”. This caused subtle usability errors. Many people, including people like me who should have the experience to know better, expect canonical isomeric SMILES by default. But for over 20 years nearly all of the toolkits followed Daylight’s lead in how they did things.
I learned about the problem when OEChem 2.0 broke with tradition and fixed the mistake. It defined the default SMILES as canonical isomeric SMILES. If you make the effort to ask for a canonical SMILES you get canonical non-isomeric SMILES, and if you really want non-canonical, non-isomeric SMILES you can ask for the “usm” format.
Year later I learned that that Open Babel did the right thing well before OpenEye. Open Babel’s “canonical” is isomeric SMILES, you must specify the “i” option to not include isotopic or chiral markings, and they don’t even refer to “isomeric SMILES”.
Chemfp follows OpenEye’s naming convention. The “smistring” format generates a canonical isomeric SMILES string, the “canstring” format generates a canonical non-isomeric SMILES string, and the “usmstring” format generates a non-canonical non-isomeric SMILES string:
>>> from chemfp import openeye_toolkit as T # use your toolkit of choice
>>>
>>> mol = T.parse_smistring("[235P].[238U]")
>>> T.create_string(mol, "smistring")
'[235P].[238U]'
>>> T.create_string(mol, "canstring")
'[P].[U]'
>>> T.create_string(mol, "usmstring")
'[P].[U]'
Here’s evidence that the “usmstring” format is non-canonical:
>>> mol = T.parse_smistring("[238U].[235P]")
>>> T.create_string(mol, "usmstring")
'[U].[P]'
>>> T.create_string(mol, "smistring")
'[235P].[238U]'
These conventions also apply when creating “smi”, “can”, and “usm” strings:
>>> T.set_id(mol, "radioactive")
>>> T.create_string(mol, "smi")
'[235P].[238U] radioactive\n'
>>> T.create_string(mol, "can")
'[P].[U] radioactive\n'
>>> T.create_string(mol, "usm")
'[U].[P] radioactive\n'
By the way, chemfp.toolkit.parse_molecule()
doesn’t distinguish
between “smi”, “can” and “usm” as input SMILES records, nor between
“smistring”, “canstring” and “usmstring”. The format only makes a
difference for output. Later on you’ll see how to specify writer_args
to have more fine-grained control over the output SMILES format. (See
RDKit-specific SMILES reader_args and writer_args,
OpenEye-specific SMILES reader_args and writer_args,
Open Babel-specific SMILES reader_args and writer_args, and
CDK-specific SMILES reader_args and writer_args for toolkit-specific
examples.)
Use format to create a record in SDF format¶
In this section you’ll learn how to convert a toolkit molecule into an SDF record. This example will use the RDKit toolkit but the results will be substantially the same for any of the four supported chemistry toolkits.
To create an SDF record as a Unicode string, pass “sdf” as the
format to chemfp.toolkit.create_string()
or use the
chemfp.toolkit.create_sdf()
function:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print(T.create_string(mol, "sdf"))
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
$$$$
>>> print(T.create_sdf(mol))
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
$$$$
If you want it in V3000 format, specify the “sdf3k” format or use the
chemfp.toolkit.create_sdf3k()
function:
>>> print(T.create_sdf3k(mol))
RDKit
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 2 1 0 0 0
M V30 BEGIN ATOM
M V30 1 C 0.000000 0.000000 0.000000 0
M V30 2 O 0.000000 0.000000 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 1 1 2
M V30 END BOND
M V30 END CTAB
M END
$$$$
There is also a chemfp.toolkit.create_bytes()
function which
returns the record as UTF-8 encoded Unicode, which all of the
supported cheminformatics toolkits use internally.
I’ll set the molecule’s name to the lower-case Greek letter ‘alpha’, and show you the interactive output from Python:
>>> T.set_id(mol, u"\N{GREEK SMALL LETTER ALPHA}")
>>> T.create_string(mol, "sdf")[:13]
'α\n RDKit '
>>> T.create_bytes(mol, "sdf")[:13]
b'\xce\xb1\n RDKit'
>>> print(T.create_sdf(mol)[:13])
α
RDKit
Use zlib record compression¶
In this section you’ll learn about the “zlib” compression option for single record parsers and byte string creation.
A record in SDF format can be large, but most of the content is repetetive. Often it’s better to store a zlib compressed record in a database instead of the full record. When I use zlib to compress each record of Compound_099000001_099500000.sdf.gz I get a 4.5-fold compression. That is, the uncompressed records take 73,024,092 bytes, the individually compressed records take 16,262,567 bytes, and the gzip compressed file takes 6,847,342 bytes. (Gzip is twice as good as individually compressed records because it can collect compression statistics across multiple records and build a better prediction model.)
Chemfp supports a zlib compression option for the record-oriented
functions, though not the file-oriented functions. To enable it, add
“.zlib” to the format string for
chemfp.toolkit.create_bytes()
. Here you can see how adding that
suffix reduces the record size:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print("uncompressed:", len(T.create_bytes(mol, "sdf")))
uncompressed: 228
>>> print("compressed:", len(T.create_bytes(mol, "sdf.zlib")))
compressed: 77
I’ll complete a round-trip conversion by parsing the compressed SD record to a molecule then converting it to a SMILES string:
>>> compressed = T.create_bytes(mol, "sdf.zlib")
>>> new_mol = T.parse_molecule(compressed, "sdf.zlib")
>>> T.create_string(new_mol, "smistring")
'CO'
The zlib option only works with chemfp.toolkit.create_bytes()
;
it does not work with chemfp.toolkit.create_string()
because the
latter only returns Unicode strings, and it’s possible for zlib to
return something which isn’t valid Unicode. Here’s what happens if you
try to use it anyway:
>>> T.create_string(mol, "sdf.zlib")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/rdkit_toolkit.py", line 419, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
File "chemfp/base_toolkit.py", line 1382, in create_string
raise ValueError("create_string() does not support compression. Use create_bytes()")
ValueError: create_string() does not support compression. Use create_bytes()
On the other hand, chemfp.toolkit.parse_molecule()
takes both
Unicode strings and byte strings as input. It treats byte strings
as being UTF-8 encoded.
Use zst record compression¶
In this section you’ll learn about the “zst” compression option for single record parsers and byte string creation.
Chemfp 3.4 added support for ZStandard compression in most places, including in the record-oriented functions, via the suffix “.zst” in the format name or filename. The following compares zlib and zst compression to the uncompressed size:
import chemfp
for toolkit_name in ("text", "rdkit", "openbabel", "openeye"):
T = chemfp.get_toolkit(toolkit_name)
with T.read_molecules("Compound_099000001_099500000.sdf.gz",
reader_args={"rdkit.sdf.removeHs": False}) as reader:
uncompressed_size = zlib_size = zst_size = 0
for mol in reader:
uncompressed_size += len(T.create_bytes(mol, "sdf"))
zlib_size += len(T.create_bytes(mol, "sdf.zlib"))
zst_size += len(T.create_bytes(mol, "sdf.zst"))
print("%r toolkit: uncompressed: %d zlib: %d (%.2f) zstd: %d (%.2f)" % (
toolkit_name, uncompressed_size, zlib_size, uncompressed_size/zlib_size,
zst_size, uncompressed_size/zst_size))
The output of the above is:
'text' toolkit: uncompressed: 73024092 zlib: 16262567 (4.49) zstd: 16976598 (4.30)
'rdkit' toolkit: uncompressed: 68180103 zlib: 15843096 (4.30) zstd: 16714426 (4.08)
'openbabel' toolkit: uncompressed: 73392094 zlib: 16295123 (4.50) zstd: 16985140 (4.32)
'openeye' toolkit: uncompressed: 73024092 zlib: 16269883 (4.49) zstd: 16977089 (4.30)
By default OEChem and Open Babel will keep hydrogens while RDKit
removes them, which makes the output SD files considerably
smaller. The reader_args
specifies rdkit.sdf.removeHs
so RDKit
will keep the hydrogens, which makes the size comparisons more
direct. The total RDKit size is still smaller than the other toolkits
because RDKit only writes 4 columns for each bond, while the others
use 7 columns.
Remember, compression effectiveness is a balance between compression time, compressed size, and decompression time. The zlib, gzip, and zst compression methods all support different compression levels. For zlib and gzip, 1 results in faster compression time but generally larger compressed sizes, and 9 gives the best compression at the cost of decreased performance. Zstandard also uses 1 for faster compression but uses 19 to get the maximum compression.
The compression level can be specified using the level
argument
of the chemfp functions which support compressed output, like
chemfp.toolkit.create_bytes()
, chemfp.toolkit.open_molecule_writer()
,
and save()
. It can be the numeric compression
level, or the words “min” for minimum compression, “default” for
default (for zlib and gzip, 3 for zstd), and “max” for maximum
compression at the expense of time.
Get a list of available formats and distinguish between input and output formats¶
In this section you’ll learn how to get the list of available formats for each object, and determine if a format can be used to get a toolkit molecule from a string record, or convert a toolkit molecule into a string record.
The toolkit’s chemfp.toolkit.get_formats()
function returns a
list of the available formats. On my computer RDKit supports 24
formats, OEChem 34, Open Babel (showing off its heritage) supports
a whopping 200, and CDK also has 23:
>>> from chemfp import rdkit_toolkit
>>> len(rdkit_toolkit.get_formats())
24
>>> rdkit_toolkit.get_formats()
[Format('rdkit/smi'), Format('rdkit/can'), Format('rdkit/usm'),
Format('rdkit/cxsmi'), Format('rdkit/sdf'), Format('rdkit/sdf3k'),
Format('rdkit/smistring'), Format('rdkit/canstring'),
Format('rdkit/usmstring'), Format('rdkit/cxsmistring'),
Format('rdkit/molfile'), Format('rdkit/rdbinmol'),
Format('rdkit/fasta'), Format('rdkit/sequence'),
Format('rdkit/helm'), Format('rdkit/mol2'), Format('rdkit/pdb'),
Format('rdkit/xyz'), Format('rdkit/mae'), Format('rdkit/inchi'),
Format('rdkit/inchikey'), Format('rdkit/inchistring'),
Format('rdkit/inchikeystring')]
>>>
>>> from chemfp import openeye_toolkit
>>> len(openeye_toolkit.get_formats())
34
>>> openeye_toolkit.get_formats()
[Format('openeye/smi'), Format('openeye/usm'),
Format('openeye/can'), Format('openeye/cxsmi'),
Format('openeye/sdf'), Format('openeye/sdf3k'),
Format('openeye/molfile'), Format('openeye/skc'),
Format('openeye/mol2'), Format('openeye/mol2h'),
Format('openeye/sln'), Format('openeye/mmod'),
Format('openeye/pdb'), Format('openeye/xyz'), Format('openeye/cdx'),
Format('openeye/mopac'), Format('openeye/mf'),
Format('openeye/oeb'), Format('openeye/inchi'),
Format('openeye/inchikey'), Format('openeye/oez'),
Format('openeye/cif'), Format('openeye/mmcif'),
Format('openeye/fasta'), Format('openeye/sequence'),
Format('openeye/csv'), Format('openeye/json'),
Format('openeye/smistring'), Format('openeye/canstring'),
Format('openeye/usmstring'), Format('openeye/cxsmistring'),
Format('openeye/slnstring'), Format('openeye/inchistring'),
Format('openeye/inchikeystring')]
>>>
>>> from chemfp import openbabel_toolkit
>>> len(openbabel_toolkit.get_formats())
200
>>> openbabel_toolkit.get_formats()
[Format('openbabel/smi'), Format('openbabel/can'),
Format('openbabel/usm'), Format('openbabel/cxsmi'),
Format('openbabel/smistring'), Format('openbabel/canstring'),
Format('openbabel/usmstring'), Format('openbabel/cxsmistring'),
Format('openbabel/sdf'), Format('openbabel/sdf3k'),
Format('openbabel/inchi'), Format('openbabel/inchikey'),
Format('openbabel/inchistring'),
Format('openbabel/inchikeystring'), Format('openbabel/CONTFF'),
... many formats omitted ...
Format('openbabel/molreport'), Format('openbabel/orca')]
>>>
>>> from chemfp import cdk_toolkit
>>> len(cdk_toolkit.get_formats())
23
>>> cdk_toolkit.get_formats()
[Format('cdk/smi'), Format('cdk/can'), Format('cdk/usm'),
Format('cdk/cxsmi'), Format('cdk/sdf'), Format('cdk/sdf3k'),
Format('cdk/smistring'), Format('cdk/canstring'),
Format('cdk/usmstring'), Format('cdk/cxsmistring'),
Format('cdk/molfile'), Format('cdk/inchi'), Format('cdk/inchikey'),
Format('cdk/inchistring'), Format('cdk/inchikeystring'),
Format('cdk/pdb'), Format('cdk/cdkjava'), Format('cdk/cml'),
Format('cdk/ctx'), Format('cdk/crystclust'), Format('cdk/gamess'),
Format('cdk/mol2'), Format('cdk/xyz')]
Some of these formats, like “can”, “inchistring”, and “sdf3k” are chemfp names for special implementations or non-standard format configurations, usually meant to have a portable way to describe a given configuration.
I’ll use chemfp.openeye_toolkit.get_format()
, which returns a
chemfp.base_toolkit.Format
, to get the “sdf” format for
OpenEye (if you don’t have access to OEChem, use one of the other
toolkits instead):
>>> import chemfp
>>> sdf_format = chemfp.openeye.get_format("sdf")
>>> sdf_format.name
'sdf'
>>> sdf_format.toolkit_name
'openeye'
The “sdf” format can be used for both input and output in all toolkits:
>>> sdf_format.is_input_format, sdf_format.is_output_format
(True, True)
However, some formats are output only, like the InChIKey format (assuming it’s available for your toolkit):
>>> inchi_fmt = chemfp.openeye.get_format("inchikey")
>>> inchi_fmt.is_input_format, inchi_fmt.is_output_format
(False, True)
On the other hand, some formats are input only, like Open Babel’s support for MOPAC’s output format:
>>> mopout_fmt = chemfp.openbabel.get_format("mopout")
>>> mopout_fmt.is_input_format, mopout_fmt.is_output_format
(True, False)
Instead of asking for all available formats, you can ask for only the
input formats, or only the output formats, using
chemfp.toolkit.get_input_formats
or
chemfp.toolkit.get_output_formats
:
>>> import chemfp
>>> for toolkit_name in ("openbabel", "openeye", "rdkit", "cdk"):
... T = chemfp.get_toolkit(toolkit_name)
... print(f"{toolkit_name} has {len(T.get_input_formats())} input formats")
... print(f"{toolkit_name} has {len(T.get_output_formats())} output formats")
...
openbabel has 157 input formats
openbabel has 145 output formats
openeye has 28 input formats
openeye has 33 output formats
rdkit has 20 input formats
rdkit has 21 output formats
cdk has 20 input formats
cdk has 21 output formats
Determine the format for a given filename¶
It’s sometimes useful to know what format will be used for a given filename. A filename can be used as a source for a reader or destination for a writer, and a toolkit might understand a given format when used as input but not as ouput, or vice-versa.
The function chemfp.toolkit.get_input_format_from_source()
returns a chemfp.base_toolkit.Format
for the given filename:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> T.get_input_format_from_source("abc.smi.gz")
Format('rdkit/smi.gz')
This is the same Format object you saw in the previous section. I
didn’t mention the compression
attribute in that discussion. It’s
“gz” for gzip-ed files, “zst” for zstandard compressed files, and the
empty string “” for uncompressed files.
>>> fmt = T.get_input_format_from_source("abc.smi.gz")
>>> fmt.name
'smi'
>>> fmt.compression
'gz'
>>>
>>> fmt = T.get_input_format_from_source("abc.smi")
>>> fmt.name
'smi'
>>> fmt.compression
''
Asking for a supported format which isn’t an input format raises a ValueError exception:
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/openbabel_toolkit.py", line 186, in get_input_format_from_source
return _format_registry.get_input_format_from_source(source, format)
File "chemfp/base_toolkit.py", line 850, in get_input_format_from_source
format_config = self.get_input_format_config(register_name)
File "chemfp/base_toolkit.py", line 773, in get_input_format_config
raise ValueError("%s does not support %r as an input format"
ValueError: Open Babel does not support 'inchikey' as an input format
even though “inchikey” is supported as an output format:
>>> openbabel_toolkit.get_output_format_from_destination("example.inchikey")
Format('openbabel/inchikey')
Yes, there’s a different function to get the format name for a source
filename than for a destination filename. Maybe in the future I’ll
support a generic get_format_from_filename()
; let me know if that
would be useful.
If you ask for a format which doesn’t exist then the functions raise a different ValueError exception:
>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
.....
File "chemfp/base_toolkit.py", line 763, in get_format_config
raise ValueError("%s does not support the %r format"
ValueError: Open Babel does not support the 'does-not-exist' format
I’ve found it useful to have a way to override the default guess. It’s amazing how many people use “.dat” for SMILES or SDF files, and “.txt” files for SMILES. The format lookup functions support a second, optional parameter, which is the format name to use.
>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist", "smi.gz")
Format('openbabel/smi.gz')
This exists so that code like:
if format is not None:
fmt = T.get_format(format)
else:
fmt = T.get_format_from_source(filename)
can be replaced with:
fmt = T.get_format_from_source(filename, format)
Working with a format object is useful when combined with format’s reader_args and writer_arg functions discussed in Specify a SMILES delimiter through reader_args
>>> fmt = openbabel_toolkit.get_input_format_from_source("input.smi")
>>> fmt.get_default_writer_args()
{'options': None, 'isomeric': True, 'canonicalization': 'default',
'explicit_hydrogens': False, 'cxsmiles': False, 'delimiter': None}
>>> fmt.get_writer_args_from_text_settings({
... "explicit_hydrogens": "true",
... "isomeric": "false",
... "delimiter": "tab"})
{'isomeric': False, 'explicit_hydrogens': True, 'delimiter': 'tab'}
Parse the id and the molecule at the same time¶
In this section you’ll learn how to parse a structure record, as a string, to extract both the identifier and the native molecule object.
Usually you want both the molecule and its id. You could parse the
molecule then use T.get_id(mol)
to get
the id, but that’s extra work, it leads to awkward looking code, and
is slower than having chemfp do the work for you when it parses the
molecule.
Instead, use chemfp.toolkit.parse_id_and_molecule()
:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> T.parse_id_and_molecule("C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a", "smi")
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x1035f14b0>)
If there is no id/title field then the id will either be None or the empty string, depending on the toolkit and format:
>>> T.parse_id_and_molecule("C", "smi")
(None, <rdkit.Chem.rdchem.Mol object at 0x1035f14b0>)
Instead of testing for the empty string or None, your code you should use “if not id:” to test for a missing id:
>>> id, mol = T.parse_id_and_molecule("C", "smi")
>>> if not id:
... print("Missing id!")
...
Missing id!
Specify alternate error behavior¶
In this section you’ll learn how to use the errors parameter to have
chemfp.toolkit.parse_molecule()
return None rather than raise an
exception, and to have it print a report about the failing molecule.
The string “Q” is not a valid SMILES string. All of the toolkits will fail to parse it, and the chemfp toolkit I/O adapter by default raises an exception when that happens:
>>> import chemfp
>>> chemfp.openbabel.parse_smistring("Q")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q'
>>>
>>> chemfp.rdkit.parse_molecule("Q", "smistring")
[16:22:55] SMILES Parse Error: syntax error while parsing: Q
[16:22:55] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'Q'
>>>
>>> chemfp.openeye.parse_molecule("Q", "smistring")
Warning: Problem parsing SMILES:
Warning: Q
Warning: ^
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "chemfp/io.py", line 136, in error
raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'Q'
>>>
>>> chemfp.cdk.parse_molecule("Q", "smistring")
Traceback (most recent call last):
File "SmilesParser.java", line 333, in org.openscience.cdk.smiles.SmilesParser.parseSmiles
Exception: Java Exception
...
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: CDK cannot parse the SMILES 'Q': unexpected character:
Q
^
On the other hand, “[NH8]” is a valid SMILES, but RDKit by default will reject it as chemically unreasonable, while OEChem and Open Babel are less strict and treat it as a molecular graph rather than a chemical molecule.
I’ll write a program which checks which toolkits will parse “[NH8]”
# I call this "check_NH8.py"
import chemfp
allowed = []; rejected = []
for name in chemfp.get_toolkit_names():
T = chemfp.get_toolkit(name)
try:
T.parse_molecule("[NH8]", "smistring")
except ValueError:
rejected.append(name)
else:
allowed.append(name)
print("Allowed:", allowed, "Rejected:", rejected)
% python check_NH8.py
[16:26:53] Explicit valence for atom # 0 N, 8, is greater than permitted
Allowed: ['openeye', 'cdk', 'openbabel'] Rejected: ['rdkit']
I think the try/except/else is sometimes harder to understand than
returning an error value, because it’s harder to see the control
flow. I can ask chemfp.toolkit.parse_molecule()
to ignore
errors, which causes it to return a None object rather than raise an
exception. turns the above loop into the following:
for name in chemfp.get_toolkit_names():
T = chemfp.get_toolkit(name)
mol = T.parse_molecule("[NH8]", "smistring", errors="ignore")
if mol is None:
rejected.append(name)
else:
allowed.append(name)
print("Allowed:", allowed, "Rejected:", rejected)
The errors option is more useful in later sections, when parsing multiple records.
The errors parameter can also take the value “report”. Like “ignore”, this will return a None when there is an error rather than raise an exception. It will also write a consistent, cross-toolkit error message to stderr, including the SMILES string that failed if the input is a SMILES:
>>> for name in chemfp.get_toolkit_names():
... print(f"Using toolkit {name!r}")
... T = chemfp.get_toolkit(name)
... mol = T.parse_molecule("Q", "smistring", errors="report")
... mol = T.parse_molecule("[NH8]", "smistring", errors="report")
...
The chemfp.toolkit.parse_id_and_molecule()
function also takes
the errors
parameter. If the structure could not be parsed then
the second component of the tuple (the molecule) will be None. The
first component (the id) may or or may not be None, depending on the
underlying implementation:
>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
[16:28:34] SMILES Parse Error: syntax error while parsing: Q
[16:28:34] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
(None, None)
>>>
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
Warning: Problem parsing SMILES:
Warning: Q q-ane
Warning: ^
(None, None)
>>>
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
==============================
*** Open Babel Error in ParseSimple
SMILES string contains a character 'Q' which is invalid
('q-ane', None)
>>>
>>> from chemfp import cdk_toolkit
>>> cdk_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
(None, None)
Future versions of chemfp may work to normalize this behavior, or let the caller choose a specific behavior.
Some of the output shown in the previous section are log messages from
the underlying toolkit sent to stderr from the C++ level. Each toolkit
has a way to change how logging works. Chemfp 4.2 added the
suppress_log_output
context manager to temporarily disable logging during the context.
Here’s an example which first shows the Open Babel log message, and then uses the context-manager to suppress the log message:
>>> from chemfp import openbabel_toolkit as T
>>> mol = T.parse_molecule("Q q-ane", "smi", errors="ignore")
==============================
*** Open Babel Error in ParseSimple
SMILES string contains a character 'Q' which is invalid
>>> if mol is None:
... print("Could not parse.")
...
Could not parse.
>>>
>>> # disable log messages
>>> with T.suppress_log_output():
... mol = T.parse_molecule("Q q-ane", "smi", errors="ignore")
... if mol is None:
... print("Could not parse.")
...
Could not parse.
Specify a SMILES delimiter through reader_args¶
In this section you’ll learn how to parse a SMILES record as a set of delimited fields instead of the default of a SMILES string followed by a title, and some of the limitations of chemfp’s attempt at a consistent cross-toolkit SMILES record parser.
You might think that the SMILES file format is well defined, but it sadly isn’t. Different toolkits have slightly different interpretations for a SMILES record format. Consider the SMILES record:
C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
The original Daylight definition is that a SMILES record is single line, which starts with the SMILES string. The SMILES string ends with the first whitespace character or the end of the line, and if there was a whitespace character than the rest of the line is the title. OpenEye follows this definition, as does chemfp. That’s why the previous example extracted “vitamin A” as the record id.
However, RDKit treats a SMILES file record as a space or tab separated set of fields, where the first field is the SMILES, the second field is the id/title and additional columns may store other properties. RDKit would use “vitamin” as the record id for this record. (RDKit can also be configured to interpret the first line as column names. Chemfp does not currently support this option, though I plan to have a cross-platform implementation in a future release.)
Chemfp normalizes the SMILES record parser API so that all toolkits by default expect the Daylight format.
Warning
Future versions of chemfp may change the default to “tab” instead of “to-eol” because CXSMILES is becoming more common.
Use the optional reader_args dictionary to specify an alternate interpretation:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a"
>>> T.parse_id_and_molecule(smiles, "smi", reader_args={"delimiter": "whitespace"})
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10f5ccfa0>)
In this case I asked it to parse the record as a set of whitespace delimited fields. If you have tab-separated fields, where a space inside of a field is not part of the delimiter, then use the “tab” delimiter:
>>> T.parse_id_and_molecule("O=O\tmolecular oxygen\t31.9988\n", "smi",
... reader_args={"delimiter": "tab"})
('molecular oxygen', <rdkit.Chem.rdchem.Mol object at 0x10fbe9590>)
The supported delimiters are:
to-eol - (default) everything past the first whitespace is interpreted as the id/title;
tab or “\t” - the fields are tab-separated; the first field is the SMILES and the second the id;
space or “ “ - the fields are space-separated;
whitespace - the fields are whitespace-separated;
native - use the native interpretation for the given toolkit;
While chemfp strives for cross-toolkit portability, it is not perfect. Leading and trailing whitespace might not be supported, so the first character of the SMILES record must also be the first character of the SMILES string. Also, the toolkit is free to interpret the first whitespace as the delimiter despite the reader_args setting. In practice, as of early 2020, Open Babel, RDKit, and OEChem will stop at the first whitespace, though I suspect they will increasingly support the CXSMILES extensions.
Neither the SMILES parser nor the other parsers validate the full contents of the reader_args dictionary. Extra items are ignored. This is deliberate because it lets you combine, say, SMILES and SDF parameters in the same dictionary without needing to check the specific format first.
To a lesser extent, it also makes it easier to specify parameters which work across multiple toolkit versions. For example, the most recent version of OEChem’s SMILES parsers added a quiet option, which chemfp will support in the future. Your code can have a {“quiet”: True} without first checking to see if this version of chemfp is new enough to support the parameter.
Warning
As a result, it’s very easy to specify a key with a typo, which is ignored, and not notice that it nothing happens.
Important
Really, I’ve been bitten by this a few times. Be extra cautious to check that you are using the right keys.
Specify an output SMILES delimiter through writer_args¶
In this section you’ll learn how to create a SMILES record with a tab
character separating the SMILES from the title using the writer_args
parameter of chemfp.toolkit.create_string()
.
By default create_string
uses a
space character to separate the SMILES from the rest of the id:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> mol = T.parse_molecule("O=O molecular oxygen\n", "smi")
>>> T.create_string(mol, "smi")
'O=O molecular oxygen\n'
To use a tab character instead, pass in a writer_args dictionary with a “delimiter” of “tab”:
>>> T.create_string(mol, "smi", writer_args={"delimiter": "tab"})
'O=O\tmolecular oxygen\n'
The writer_args delimiter also accepts “whitespace”, “space”, “to-eol” and the other values from reader_args. Only “tab” and “\t” will use a tab character as the delimiter; all of the the others will use a space character.
Warning
Future versions of chemfp may change the default to “tab” to better support the use of CXSMILES extensions.
Neither the SMILES writer nor the other writers validate the full contents of the writer_args dictionary. Extra items are ignored. This is deliberate because it lets you combine, say, SMILES and SDF parameters in the same dictionary without needing to check the specific format first. It also makes it easier to specify parameters which work across multiple toolkit versions.
Warning
As a result, it’s very easy to specify a key with a typo, which is ignored, and not notice that it nothing happens.
Important
Really, I’ve been bitten by this a few times. Be extra cautious to check that you are using the right keys.
RDKit-specific SMILES reader_args and writer_args¶
In this section you’ll learn how to pass toolkit-specific parameters to the RDKit toolkit functions to parse and create a SMILES string. You will need the RDKit toolkit.
Earlier I showed that RDKit by default does a sanitization check to verify that the input is correct.
>>> from chemfp import rdkit_toolkit
>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", errors="ignore")
[16:31:55] Explicit valence for atom # 0 N, 8, is greater than permitted
>>> mol is None
True
The underlying RDKit code to parse a SMILES string, MolFromSmiles, takes a sanitize parameter. The default, True, tells it to do the sanitization step, while False disables it.
Use the reader_args dictionary to pass the sanitize parameter to the underlying toolkit function:
>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", reader_args={"sanitize": False})
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x107590a60>
>>> from rdkit import Chem
>>> Chem.MolToSmiles(mol)
'[NH8]'
Use the writer_args dictionary to pass toolkit-specific parameters to RDKit’s MolToSmiles:
>>> mol = rdkit_toolkit.parse_molecule("c1ccccc1[16OH]", "smistring")
>>> rdkit_toolkit.create_smistring(mol, "smistring")
'[16OH]c1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
... writer_args={"isomericSmiles": False})
'Oc1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
... writer_args={"kekuleSmiles": True, "allBondsExplicit": True})
'[16OH]-C1:C:C:C:C:C:1'
See Get the default reader_args or writer_args for a format for a description of how
to get the default reader and writer arguments for a given format, and
use help(rdkit_toolkit.read_molecules
)
and help(rdkit_toolkit.open_molecule_writer
) to get a more human-readable description.
The helper functions like parse_smistring()
and
parse_sdf()
take keyword arguments instead of
reader_args, which is easier if know you will target that exact
toolkit and format:
>>> rdkit_toolkit.parse_smistring("[NH8]", sanitize=False)
<rdkit.Chem.rdchem.Mol object at 0x15b2b1d20>
OpenEye-specific SMILES reader_args and writer_args¶
In this section you’ll learn how to pass toolkit-specific parameters to the OEChem toolkit functions to parse and create a SMILES string. You will need the OEChem toolkit. See the next section for specific details about aromaticity.
By default the OEChem SMILES parser is tolerant of bad SMILES. I
believe it’s too tolerant, because will gladly parse what I think are
invalid SMILES, like “C-=C
”:
>>> from chemfp import openeye_toolkit as T
>>> mol = T.parse_smistring("C-=C")
>>> T.create_smistring(mol)
'C=C'
The developers at OpenEye recognize that pedantic folk like me exist. The OEChem SMILES parser has a “strict” mode, which I can enable in chemfp through the “flavor” parameter of the reader_args dictionary:
>>> mol = T.parse_molecule("C-=C", "smistring",
... reader_args={"flavor": "Strict"})
Warning: Unknown flavor (16385) for format (27) in set iflavor.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "chemfp/io.py", line 139, in error
raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'C-=C'
The “Warning: Unknown flavor” is an unfortunate difference between how OEChem parses a SMILES vs. a CXSMILES. Starting with chemfp 4.1, chemfp will interpret a SMILES string as CXSMILES unless the “cxsmiles” reader_arg is False. Doing that here gives a more informative error message:
>>> mol = T.parse_molecule("C-=C", "smistring",
... reader_args={"flavor": "Strict", "cxsmiles": False})
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
....
File "chemfp/io.py", line 139, in error
raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'C-=C'
Going back to error handling, the underlying OEParseSmiles() function takes the optional strict and canon parameters. Why does chemfp use the term “flavor”? Why the capitalization for “Strict”?
Historically the low-level OEChem functions took individual parameters, like the positional arguments canon and strict:
>>> mol = OEGraphMol()
>>> OEParseSmiles(mol, "C-=C", False, True)
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning: ^
False
(I wrote “historically” because more recent versions have format-specific options classes, like OEParseSmilesOptions for SMILES. These collect all of the configuration options into a single parameter, which is easier to pass around.)
On the other hand, the high-level molecule parsers take a single “flavor” integer value to specify the options for a given format. This flavor is usually expressed as the union of a set of bitmasks. I’ll show how OEChem’s Python API uses the flavor parameter.
The following OEChem code reads a SMILES file in the default non-strict mode (with no specified flavor):
% cat example.smi
C=-C bad
CCC good
% python
...
>>> from openeye.oechem import *
>>> ifs = oemolistream("example.smi")
>>> for mol in ifs.GetOEGraphMols():
... print(mol.GetTitle(), mol.NumAtoms())
...
bad 2
good 3
while the following sets the SMILES flavor to use “strict” mode:
>>> ifs = oemolistream("example.smi")
>>> ifs.SetFlavor(OEFormat_SMI, OEIFlavor_SMI_Strict)
True
>>> for mol in ifs.GetOEGraphMols():
... print(mol.GetTitle(), mol.NumAtoms())
...
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C=-C bad
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
good 3
(You can see some terminology differences between me and OpenEye in the warning message. The “Canonical” and “stereo” are only meaningful as a description of the output format, not the input format, and I use the traditional term “isomeric” while they highlight the more important stereochemistry aspect. I also got confused because I thought at first the “Canonical” had something to do with OEIFlavor_SMI_Canon.)
I decided to base chemfp’s openeye_toolkit
API on the high-level “flavor” API of
OEChem, which is better documented and requires less work on my part
to implement than low-level functions. But I also decided to extend it
to support a string value, and not just an integer.
To explain how that works, I’ll switch from describing reader_args to writer_args, because raising an exception with the “Strict” option gets boring, fast.
The OEChem SMILES output flavors are: OEOFlavor_SMI_AtomMaps
,
OEOFlavor_SMI_AtomStereo
, …. and you know what? The
OEOFlavor_SMI_
prefix is part of what makes the flavors hard to
use in Python, so I’ll omit the prefix in chemfp. The OEChem SMILES
output flavors are: AllBonds
, AtomMaps
, AtomStero
,
BondStereo
, Canonical
, ExtBonds
, Hydrogens
,
ImpHCount
, Isotopes
, Kekule
, RGroups
, SmiMask
, and
SuperAtoms
. There are also Default
and DEFAULT
which are
the bitwise union
RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical
.
In chemfp you can specify the fields as a “|” or “,” separated list of flavor flags, without the prefix. Here are several different ways to specify the default settings for isomeric canonical SMILES string output:
>>> mol = openeye_toolkit.parse_molecule("[16O][*:1]", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": ""})
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default"})
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical"})
'[R1][16O]'
These settings override any options which might be implied by the format name. Thus, even though “smistring” is supposed to generate an isomeric canonical SMILES, I can use the writer_args to remove the isomeric component from the flavor:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "RGroups|AtomStereo|BondStereo|AtomMaps|Canonical"})
'[R1][O]'
While I used “|” as the separator, I can equally use “,”, as in:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Isotopes,Canonical"})
'*[16O]'
OEChem uses the bar as a bitwise-or operator which merges the different flags. I added the comma as an alternative to the vertical bar because chemfp has additional syntax for removing options. The following removes the “RGroups” option from the isomeric and non-isomerical formats defaults, but otherwise leaves the defaults alone:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default,-RGroups"})
'[*:1][16O]'
>>>
>>> openeye_toolkit.create_string(mol, "canstring",
... writer_args={"flavor": "Default,-RGroups"})
'[*:1][O]'
(The terms are evaluated from left to right, so you can delete a term then add it back if you want.)
I added a comma because writing this as Default|-RGroups
caused
the C programmer mind in me to gasp in bewilderment. (“The bitwise-or
with the negative of the RGroups bitflags?!!”)
You don’t need to specify the OEChem flavor using a flavor string. You can also specify it as an integer:
>>> from openeye.oechem import *
>>> (OEOFlavor_SMI_Isotopes|OEOFlavor_SMI_AtomStereo|OEOFlavor_SMI_BondStereo|
... OEOFlavor_SMI_AtomMaps|OEOFlavor_SMI_Canonical)
121
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": 121})
'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": 0})
'[O]*'
or (and this might be a bit excessive) as a string-encoded integer:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "121"})
'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "0"})
'[O]*'
Chemfp tries to be helpful. It will include the list of available flavor names in the exception if it doesn’t understand what you gave it:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "chocolate"})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/openeye_toolkit.py", line 446, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
... lines removed ...,
File "chemfp/_openeye_toolkit.py", line 1174, in parse_flavor
raise err
ValueError: OEChem smi format does not support the 'chocolate'
flavor option. Available flavors are: AllBonds, AtomMaps,
AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens,
ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms
See Get the default reader_args or writer_args for a format for a description of how
to get the default reader and writer arguments for a given format, and
use help(openeye_toolkit.read_molecules
)
and help(openeye_toolkit.open_molecule_writer
) to get a more human-readable description.
OpenEye-specific aromaticity¶
In this section you’ll learn how chemfp handles OpenEye’s aromaticity parameter. You will need the OEChem toolkit, and you should read the previous section to understand some of the terminology.
Note: the OEGraphSim fingerprints are not affected by the aromaticity of the reader because the fingerprint generators ensure that the molecules are always perceived using “openeye” aromaticity before generating the fingerprint.
The OpenEye toolkit supports the “openeye”, “daylight”, “tripos”, “mdl”, and “mmff” aromaticity models. In the high-level API, which is meant for reading and writing files or file-like objects, the aromaticity is an aspect of the flavor integer. If unspecified, OEChem uses the appropriate default aromaticity model for that format. As a result, aromaticity perception is required for both reading and writing files.
The low-level API handles file processing and aromaticity perception as distinct steps. This API can also process a single record directly, while the high-level API requires wrapping the record in a file-like object and then reading the first molecule from it.
The chemfp toolkit API is a high-level API for both files and records, which means I had to implement record conversion routines on top of OEChem’s low-level API. Consequently, some of the details are different between the file I/O and record I/O APIs; the most significant being that the record I/O routines also support a “none” aromaticity.
The following shows the default aromaticity proceessing in action:
>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring")
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[True, True, True, True, True, True]
Automatic aromaticity perception is normally the right thing to do, because different toolkits and even different versions of the same toolkit may have different ideas of what is aromatic, and it’s best to ensure that they are consistently interpreted.
Aromaticity perception isn’t needed when you know that the input aromaticity is correct and unambiguous. My timings show that aromaticity perception takes about half of the time needed to parse a SMILES string. If the string comes from a good data source, like a database record where OEChem created the SMILES, then you can nearly double the performance by omitting the perception step.
What does “ambiguous” mean? Consider azulene, which can be described by the SMILES “c1ccc2cccc2cc1”. The fusion bond is not aromatic, while the peripheral bonds form a 10 pi electron system. In SMILES, an unspecified bond means “single or aromatic”. If one of the terminal atoms is aliphatic then the bond must be a single bond. But as the fusion bond in azulene shows, it’s possible for an unspecified bond with terminal aromatic atoms to still be non-aromatic. The above SMILES is ambiguous, and OEChem needs to do a full aromaticity analysis to determine that the fusion bond is not aromatic.
An unambiguous SMILES for azulene is “c1ccc-2cccc2cc1”, where the fusion bond is marked explicitly as a single bond. The SMILES parser can use the simpler rule that an unspecified ring bond is aromatic whenever both terminal atoms are aromatic, and not require the lengthy aromatic perception step to determine that. OEChem generates unambiguous SMILES, so if you know OEChem generated the SMILES then you can recover the original aromaticity directly.
(As a side note, Daylight first introduced this in 4.71, and used fluorene (“C1c2ccccc2-c3ccccc13”) as the prototypical case. Daylight’s rule is to include the “-” for a single bond between two aromatic atoms, while OEChem’s rule is to include the “-” for a single bond between two aromatic atoms and which is in a ring. Ring identification is much easier than aromaticity perception.)
So where was I … ah, right, specifing the aromaticity model. I decided to separate aromaticity from the rest of the flavor flags, and specify it with its own reader_args and writer_args field. It’s easiest to see using beneze in Kekule form:
>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring",
... reader_args={"aromaticity": "none"})
>>>
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[False, False, False, False, False, False]
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"aromaticity": "none"})
'C1=CC=CC=C1'
NOTE: the aromaticity flags are volatile. If you don’t specify the
“none” aromaticity model then chemfp.toolkit.create_string()
will reperceive aromaticity using the “openeye” aromaticity model and
possibly reassign the aromaticity flags.
>>> openeye_toolkit.create_string(mol, "smistring")
'c1ccccc1'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"aromaticity": "none"})
'c1ccccc1'
This is consistent with how OEChem’s high-level operations also modify the input molecule when creating output. I’m not fully happy with it. OEChem also has a “ConstMolecule” version, so this detail may change in the future.
Open Babel-specific SMILES reader_args and writer_args¶
In this section you’ll learn how to pass toolkit-specific parameters to the Open Babel toolkit functions to create a SMILES string. You will need the Open Babel toolkit.
As far as I can tell, Open Babel does not have configuration options to change the default SMILES parser, so chemfp has no toolkit-specific reader_args for that toolkit. Open Babel does have configuration options to change the default SMILES output routines. These can be set in chemfp with the writer_args dictionary.
Open Babel uses an options string to change the configuration. The
string “i U smilesonly
” generates non-isomeric SMILES output, where
the atom ordering is determined by the InChI’s canonicalization
algorithm (“Universal SMILES”), and where the identifier is excluded
from the SMILES output.
Did you know all of that? I didn’t. Some of these options are only documented in the code. It’s also difficult for chemfp to handle since some of the options conflict with how chemfp thinks of things. For example, chemfp is in charge of including the identifier, so it will always enable “smilesonly”, and it’s difficult for the “cansmiles” output, which is non-isomeric, to know if an options string wants to override the default”i” option that it requires.
I ended up making my own writer_args API to have more explicit control over the individual parameters:
explicit_hydrogens - boolean
isomeric - boolean
canonicalization - a string like “default”, “none”, “universal”, “anticanonical”, or “inchified”
options - the Open Babel options string (if you must use it; using it may break things if you are not very careful.)
Here’s an example of how to disable isomeric support for the “smistring” output, which would normally generate an isomeric SMILES:
>>> from chemfp import openbabel_toolkit
>>> mol = openbabel_toolkit.parse_smistring("[16O]=O")
>>> openbabel_toolkit.create_string(mol, "smistring")
'[16O]=O'
>>> openbabel_toolkit.create_string(mol, "smistring",
... writer_args={"isomeric": False})
'O=O'
I can also enable isomeric SMILES for the “canstring” format, which is normally non-isomeric:
>>> openbabel_toolkit.create_string(mol, "canstring")
'O=O'
>>> openbabel_toolkit.create_string(mol, "canstring",
... writer_args={"isomeric": True})
'[16O]=O'
Open Babel supports several different canonicalization algorithms. Perhaps the most unusual one is “anticanonical”, which uses random numbers for the atom ordering algorithm. The same molecule can generate different SMILES strings across multiple calls, so it’s the antithesis of “canonical”:
>>> for i in range(5):
... print(openbabel_toolkit.create_string(mol, "smistring",
... writer_args={"canonicalization": "anticanonical"}))
...
[16O]=O
[16O]=O
O=[16O]
[16O]=O
[16O]=O
As with the other toolkits, Open Babel’s format-specific helper
functions like parse_sdf
and create_smistring
use keyword arguments
instead of reader_args and writer_args. Here’s another way to do the
same anti-canonicalization example:
>>> for i in range(5):
... print(openbabel_toolkit.create_smistring(mol, canonicalization="anticanonical"))
...
O=[16O]
O=[16O]
[16O]=O
[16O]=O
O=[16O]
See Get the default reader_args or writer_args for a format for a description of how
to get the default reader and writer arguments for a given format, and
use help(openbabel_toolkit.read_molecules
)
and help(openbabel_toolkit.open_molecule_writer
) to get a more human-readable description.
CDK-specific SMILES reader_args and writer_args¶
In this section you’ll learn how to pass toolkit-specific parameters to the CDK toolkit functions to parse and create a SMILES string. You will need the CDK JAR file on your CLASSPATH and the JPype Java/Python adapter installed. (See the [installation guide for help]).
By default CDK will find a Kekule assignment of the input SMILES:
>>> from chemfp import cdk_toolkit
>>> cdk_toolkit.parse_molecule("oC", "smistring")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: CDK cannot parse the SMILES 'oC': a valid kekulé structure could not be assigned
This can be disabled with the kekulise=False
reader argument:
>>> mol = cdk_toolkit.parse_molecule("oC", "smistring", reader_args={"kekulise": False})
>>> [a.isAromatic() for a in mol.atoms()]
[True, False]
I’m not sure why you would need it, but it’s there. As with the other
toolkits, the format helper functions like parse_smistring
use keyword arguments instead of
taking a reader_args dictionary, so the following also works:
>>> mol = cdk_toolkit.parse_smistring("oC", kekulise=False)
The CDK writer_args are more interesting. The default SMILES writer generates the SMILES in Kekule form:
>>> mol = cdk_toolkit.parse_smistring("c1ccccc1[16OH]")
>>> cdk_toolkit.create_string(mol, "smistring")
'C1=CC=C(C=C1)[16OH]'
CDK uses a flavor
parameter similar to how OpenEye’s flavor system
works. I’ll use first specify the “Default” flavor, using both the
format-agnostic create_string
and
format-specific create_smistring
:
>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "Default"})
'C1=CC=C(C=C1)[16OH]'
>>> cdk_toolkit.create_smistring(mol, flavor="Default")
'C1=CC=C(C=C1)[16OH]'
then use “???
” to cause the flavor parser to fail and show a list
of possible options:
>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "???"})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
ValueError: CDK smistring format does not support the '???' flavor
option. Available flavors are: Absolute, AtomAtomMap,
AtomAtomMapRenumber, AtomicMass, AtomicMassStrict, Canonical,
Cx2dCoordinates, Cx3dCoordinates, CxAtomLabel, CxAtomValue,
CxCoordinates, CxFragmentGroup, CxMulticenter, CxPolymer, CxRadical,
CxSmiles, CxSmilesWithCoords, Default, InChILabelling, Isomeric,
Stereo, StereoCisTrans, StereoExCisTrans, StereoExTetrahedral,
StereoTetrahedral, Unique, UniversalSmiles, UseAromaticSymbols
CDK has an a lot of options! Let’s try a few. First, use aromatic symbols instead of Kekule form:
>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "Default,UseAromaticSymbols"})
'c1ccc(cc1)[16OH]'
>>> cdk_toolkit.create_smistring(mol, flavor="Default,UseAromaticSymbols")
'c1ccc(cc1)[16OH]'
Now, also disable isomeric SMILES:
>>> cdk_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default,UseAromaticSymbols,-Isomeric"})
'Oc1ccccc1'
>>> cdk_toolkit.create_smistring(mol, flavor="Default,UseAromaticSymbols,-Isomeric")
'Oc1ccccc1'
If you know the CDK toolkit then you should be able to figure out how to use these flavor flags.
Get the default reader_args or writer_args for a format¶
In this section you’ll learn how to get the default reader_args and writer_args for a given format.
As you’ve seen, each toolkit format can have its own reader_args and writer_args parameters, and chemfp layers its own format types (like “smistring”) on top of the native formats. It’s easy to forget the specific parameters for a given format, much less the default values.
The get_default_reader_args()
and
get_default_writer_args()
methods of the
Format
object return the respective default arguments:
>>> from chemfp import rdkit_toolkit
>>> fmt = rdkit_toolkit.get_format("smi")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'has_header': False, 'delimiter': None}
>>> fmt.get_default_writer_args()
{'isomericSmiles': True, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False, 'cxsmiles': False,
'delimiter': None}
You can sometimes use this information to see how chemfp maps its format types to the toolkit parameters. In RDKit, the difference between chemfp’s “smi” and “can” formats is that isomericSmiles is True for the first and False for the second:
>>> rdkit_toolkit.get_format("can").get_default_writer_args()
{'isomericSmiles': False, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False, 'cxsmiles': False,
'delimiter': None}
While writing this documentation I realized that the OEChem toolkit shows neither the default flavor nor the default aromaticity for a given format type. I will likely improve that in a future version of chemfp.
Convert text settings into reader and writer arguments¶
In this section you’ll learn how to convert text-based configuration settings into the appropriate reader_args or writer_args dictionary.
The reader_args and writer_args take native Python values,
including integers and booleans. In practice these will often be
defined in a configuration file, through command-line options, or as
CGI parameters. The Format
methods
get_reader_args_from_text_settings()
and
get_writer_args_from_text_settings()
convert a
text-based settings dictionary into the appropriate arguments
dictionary with native Python objects as values. (These are methods of
the Format object, because the parameter details are format-specific.)
The following shows an example using the RDKit toolkit’s “sdf” format to get reader_args from a dictionary of text settings:
>>> from chemfp import rdkit_toolkit
>>>
>>> sdf_format = rdkit_toolkit.get_format("sdf")
>>> sdf_format.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}
>>>
>>> sdf_format.get_reader_args_from_text_settings({
... "strictParsing": "true",
... "removeHs": "False",
... "sanitize": "0"})
{'sanitize': False, 'removeHs': False, 'strictParsing': True}
The boolean setting parser converts “true”, “True”, and “1” to Python’s True, and “false”, “False”, and “0” to Python’s False. Otherwise it raises a ValueError.
The following shows an equivalent example for RDKit’s SDF writer_args:
>>> sdf_format.get_default_writer_args()
{'includeStereo': False, 'kekulize': True, 'v3k': False}
>>> sdf_format.get_writer_args_from_text_settings({
... "kekulize": "false", "v3k": "true",
... "includeStereo": "True"})
{'includeStereo': True, 'kekulize': False, 'v3k': True}
Warning
These functions will ignore unknown keys. This was done to allow the text settings dictionary to contain settings for other toolkits and formats. As a result, typos are harder to detect, because they will be ignored.
See argparse text settings to reader and writer args for an example of converting text settings from the command-line into reader and writer arguments.
Multi-toolkit reader_args and writer_args¶
In this section you’ll learn how to configure reader_args and writer_args so the same dictionary can be used to configure multiple toolkits and formats.
Sometimes you don’t know which toolkit will be used for parsing, but you do know that you want Open Babel, OEChem, and RDKit to act in non-standard ways. For example, the choice of toolkit may depend on the user-defined fingerprint type, or simply (as in the following example) depend on user input.
The reader_args and writer_args will ignore unknown parameters, which lets you combine arguments for different toolkits into a single dictionary. As the toolkits use completely different parameter names (except a couple, like “delimiter”, which are supposed to act the same for all toolkits), there’s no conflict in the names for a given format.
The following defines a reader_args dictionary and a writer_args dictionary with parameters for each supported toolkit, then enters a loop. The loop asks the user for a SMILES string, or the name of the toolkit to use, or “q” to quit the loop. It will parse each SMILES into a molecule, then generate a SMILES output, although with decidedly strange parameters:
# Call this "strange_smi.py"
import chemfp
from chemfp import rdkit_toolkit as T # use your default toolkit of choice
#from chemfp import openeye_toolkit as T
#from chemfp import openbabel_toolkit as T
#from chemfp import cdk_toolkit as T
reader_args = {
"sanitize": False, # RDKit,
"openeye.*.flavor": "Default,Strict", # OEChem
"aromaticity": "none", # OEChem
}
writer_args = {
"kekuleSmiles": True, # RDKit
"canonicalization": "anticanonical", # Open Babel
"aromaticity": "daylight", # OEChem
"cdk.*.flavor": "Default,UseAromaticSymbols", # CDK
}
print("Using", T.name, "toolkit")
while 1:
query = input("SMILES, toolkit name, or 'q' to quit? ")
if not query or query == "q":
break
if query in ("rdkit", "openeye" ,"openbabel", "cdk"):
try:
T = chemfp.get_toolkit(query)
except ValueError:
print("Toolkit %r not available" % (query,))
print("Using", T.name, "toolkit")
continue
mol = T.parse_molecule(query, "smistring", reader_args=reader_args, errors="ignore")
if mol is None:
print("Toolkit", T.name, "could not parse query as SMILES")
continue
smiles = T.create_string(mol, "smistring", writer_args=writer_args, errors="ignore")
if not smiles:
print("Toolkit", T.name, "could not convert the molecule to SMILES")
continue
print(" -->", smiles)
I saved the above to a script and then ran it. It starts using RDKit, where I’ve set the reader’s “sanitize” to False so RDKit won’t perceive aromaticity on input, and set the writer’s “kekuleSmiles” to show explicit aromatic bond types:
% python strange_smi.py
Using rdkit toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> OC1=CC=CC=C1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> OC1:C:C:C:C:C:1
I then switch to the OpenEye toolkit, show that it is operating with “strict” added to the default reader flavor, and convert a couple of SMILES to canonical SMILES to show the output uses the Daylight aromaticity model instead of the default:
SMILES, toolkit name, or 'q' to quit? openeye
SMILES, toolkit name, or 'q' to quit? C==C
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C==C
Warning: ^
Toolkit openeye could not parse query as SMILES
Using openeye toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> c1ccc(cc1)O
I then switch to the Open Babel toolkit and show that it generates “anti-canonical” SMILES, where the spanning tree priority order for SMILES output is randomly assigned:
SMILES, toolkit name, or 'q' to quit? openbabel
Using openbabel toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> c1c(O)cccc1
SMILES, toolkit name, or 'q' to quit? q
Finally I switch to CDK and show that it generate aromatic SMILES instead of Kekule:
SMILES, toolkit name, or 'q' to quit? cdk
Using cdk toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> C1=CC=C(C=C1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> c1ccc(cc1)O
See argparse text settings to reader and writer args for an example of using multi-toolkit reader_args and writer_args.
Qualified reader and writer parameters names¶
In this section you’ll learn how to use qualified parameter names. These give fine-grained control over the configuration options for each toolkit and format.
The previous section pointed out that the four toolkits use different parameter names, so for a given format you can combine the toolkit-specific reader_args into one unified dictionary and writer_args into another unified dictionary. However, within a toolkit the same parameter name can be reused for different formats, with different meanings.
This best example is for the chemfp.openeye_toolkit
, where the
reader_args and writer_args for all formats support the “flavor”
and “aromaticity” parameters. The following shows examples where I
might use a different flavor for the SMILES and InChI outputs, to get
something other than the default representation:
>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>>
>>> openeye_toolkit.create_string(mol, "smistring")
'CC(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default|ImpHCount"})
'[CH3]C(=O)[O-]'
>>>
>>> openeye_toolkit.create_string(mol, "inchistring")
'InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1'
>>> openeye_toolkit.create_string(mol, "inchistring",
... writer_args={"flavor": "Default|FixedHLayer"})
'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'
Chemfp uses “qualified” parameter names to handle this situation. For example, the qualified name “smistring.flavor” is the flavor parameter for the smistring format:
>>> writer_args = {
... "smistring.flavor": "Default|ImpHCount",
... "inchistring.flavor": "Default|FixedHLayer",
... }
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring", writer_args=writer_args)
'[CH3]C(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "inchistring", writer_args=writer_args)
'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'
Note
There are six SMILES-related formats (“smi”, “can”, “usm”, “smistring”, “canstring”, and “usmstring”) so to be complete you’ll need to specify values for all of them. There are also two InChI-related formats (“inchi” and “inchistring”).
A “fully qualified” name looks like “openeye.smistring.flavor”. The first term is the toolkit, the second the format name, and the last the parameter name. At present there little need for fully qualified names because most parameter names are either unique to a toolkit and format type, or (like ‘delimiter’) supposed to be identical across all toolkits. The major exception is ‘flavor’, used by all of the OpenEye formats as well as the RDKit “fasta”, “sequence”, and “pdb” formats.
The following demonstration, which is more a parlor trick than something useful, shows how to have each toolkit use a different SMILES delimiter:
>>> import chemfp
>>>
>>> reader_args = {
... "rdkit.smi.delimiter": "tab",
... "openbabel.smi.delimiter": "whitespace",
... "openeye.smi.delimiter": "to-eol",
... "cdk.smi.delimiter": "space",
... }
>>>
>>> for toolkit_name in ("rdkit", "openbabel", "openeye", "cdk"):
... T = chemfp.get_toolkit(toolkit_name)
... id, mol = T.parse_id_and_molecule("C\tabc def\tghi", "smi",
... reader_args=reader_args)
... print(toolkit_name, "sees the id", repr(id))
...
rdkit sees the id 'abc def'
openbabel sees the id 'abc'
openeye sees the id 'abc def\tghi'
cdk sees the id 'def\tghi'
(As a reminder, the ‘delimiter’ implementation is not perfect. A toolkit may accept the first whitespace after the SMILES term as a valid delimiter even if it doesn’t match the actual parameter, and a toolkit may decide to stop parsing the SMILES term at the first whitespace.)
The final type of qualified parameter looks like “openeye.*.aromaticity”, where the first term is the toolkit name, the second term is “*”, and the third term is the parameter name. This is most useful if you want OEChem to enforce the same aromaticity across all formats, or have the RDKit parsers ignore sanitization, with configuration entries like:
{"openeye.*.aromaticity": "daylight",
"rdkit.*.sanitize": False}
However, as only OEChem supports “aromaticity” and only RDKit supports “sanitize”, you could also write this as simply:
{"aromaticity": "daylight",
"sanitize": False}
Qualified parameter priorities¶
In this section you’ll learn the priority order when multiple terms try to specify the same parameter.
In the previous section you learned how “delimiter”, “smi.delimiter”, “rdkit.*.delimiter” and “rdkit.smi.delimiter” can all be used to set the delimiter style for RDKit’s “smi” format. If more then one term is specified, which one wins?
Chemfp checks for the parameters in the following order:
rdkit.smi.delimiter
rdkit.*.delimiter
smi.delimiter
delimiter
The parameter with the highest ranking determines the setting, as the following shows:
>>> from chemfp import rdkit_toolkit as T
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"delimiter": "to-eol",
... "smi.delimiter": "whitespace"})
>>> id
'methane'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"rdkit.*.delimiter": "to-eol",
... "smi.delimiter": "whitespace"})
>>> id
'methane 16.04246'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"rdkit.*.delimiter": "to-eol",
... "rdkit.smi.delimiter": "whitespace"})
>>> id
'methane'
One way to remember it is the longest name has priority.
It can be confusing to have a large dictionary with multiple format
and toolkit qualifiers. The get_unqualified_reader_args()
and get_unqualified_writer_args()
methods of
Format
object will return the fully unqualified
reader_args and writer_args for that format:
>>> fmt = T.get_format("smi")
>>> fmt.get_unqualified_reader_args({
... "delimiter": "to-eol",
... "smi.delimiter": "whitespace",
... })
{'sanitize': True, 'has_header': False, 'delimiter': 'whitespace'}
>>> fmt.get_unqualified_writer_args({
... "delimiter": "space",
... "smi.delimiter": "tab",
... })
{'isomericSmiles': True, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False,
'cxsmiles': False, 'delimiter': 'tab'}
This can also be helpful if you think you made a typo; get the unqualified reader_args and see if the result has the arguments you think it should have.
Qualified names and text settings¶
In this section you’ll learn how the qualified names also apply to text settings.
Earlier you learned that text settings are string-based keys and values, which might come from the command-line, a configuration file, or some other text-based source. These need to be converted into Python values before they can be used as reader_args or writer_args.
A Format
object can convert a dictionary of text settings
into the correct argument dictionary. To get a Format object, ask the
toolkit for the format of the given name:
>>> from chemfp import rdkit_toolkit as T
>>> fmt = T.get_format("sdf")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}
The section Convert text settings into reader and writer arguments showed how to convert the text settings with unqualified names into a reader_args dictionary:
>>> fmt.get_reader_args_from_text_settings({
... "strictParsing": "false",
... "removeHs": "false",
... })
{'removeHs': False, 'strictParsing': False}
The text settings dictionary also supports qualified parameter names, including handling the priority resolution described in Qualified parameter priorities:
>>> fmt.get_reader_args_from_text_settings({
... "strictParsing": "false",
... "sdf.strictParsing": "true",
... "removeHs": "false",
... "rdkit.*.removeHs": "true",
... "rdkit.sdf.sanitize": "false",
... })
{'sanitize': False, 'removeHs': True, 'strictParsing': True}
If you stare at it for a bit you’ll see that “sdf.strictParsing” has a higher priority than “strictParsing” and “rdkit.*.removeHs” is higher than “removeHs”, which is how it’s supposed to work.
Read molecules from an SD file or stdin¶
In this section you’ll learn how to read an SD file and iterate through its records as toolkit molecules. You will need Compound_099000001_099500000.sdf.gz from PubChem.
Time to get back to molecules! The
chemfp.toolkit.read_molecules()
function reads molecules from a
structure file:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
for mol in T.read_molecules("Compound_099000001_099500000.sdf.gz"):
print(T.create_string(mol, "smistring"))
By default it uses the filename extension to figure out the format and compression type. You can specify it yourself, if you wish, using the format option:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
for mol in T.read_molecules("Compound_099000001_099500000.sdf.gz",
format="sdf.gz"):
print(T.create_string(mol, "smistring"))
Examples of valid format values are “smi”, “can”, and “usm” (but not the *string variants like “smistring”, because those aren’t record-based formats), and “sdf”, as well as gzip-compressed versions like “smi.gz” and “sdf.gz”.
(For Open Babel the “.gz” extension does nothing as Open Babel will auto-detect and handle gzip compressed input. Chemfp’s RDKit interface also support zstandard-compressed files with the extension “.zst” if the Python package “zstandard” is installed.)
If the first parameter (the source parameter) is the Python None
value then the toolkit will read from stdin. As there’s no filename,
chemfp can’t look at the extension to figure out the format, so it
assumes the input is in “smi” format, that is, an uncompressed SMILES
file.
Therefore, to read an SD file from stdin you must specify the format. The following program reads a gzip compressed SD file from stdin, convert it to SMILES, and find the 10 most common characters used in the SMILES strings:
# This file is named 'count_smiles_characters.py'
from collections import Counter
from chemfp import rdkit_toolkit as T # use your toolkit of choice
symbol_counts = Counter()
for mol in T.read_molecules(None, "sdf.gz"):
smiles = T.create_string(mol, "smistring")
symbol_counts.update(smiles)
for symbol, count in symbol_counts.most_common(10):
print(f"{count:7d}: {symbol!r}")
Now to try it on a data set (I’ll omit the toolkit warnings):
% python count_smiles_characters.py < Compound_099000001_099500000.sdf.gz
117039: 'c'
96083: 'C'
51098: '('
51098: ')'
33562: '1'
29375: 'O'
22538: '='
20158: '2'
19293: '@'
18757: 'N'
Read ids and molecules from an SD file at the same time¶
In this section you’ll learn how to read an SD file and iterate through its records as the two-element tuple of (id, molecule). You will need the Compound_099000001_099500000.sdf.gz from PubChem, which was used in the previous section.
In an earlier section, Parse the id and the molecule at the same time, you learned how
to parse a structure record to get both the identifier and the
molecule at the same time. The toolkit function
chemfp.toolkit.read_ids_and_molecules()
is the equivalent for
reading from a structure file.
In the following example I’ll use the RDKit toolkit to create a tab-separated file with the id in the first column, the number of carbon atoms in the second, and the SMILES in the third. For brevity, I’ll display only the first 10 records, which also gives a nice example of when to use itertools.islice:
from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with rdkit_toolkit.read_ids_and_molecules(filename) as reader:
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
smiles = rdkit_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
(See the next section for a description of how the line with the
sum()
works.)
Here’s the output, and a spot check shows the carbon counts are correct:
99000039 21 O=C(CC[C@H]1NC(=O)c2ccccc2NC1=O)Nc1cccc2ncccc12
99000230 21 COc1ccc(S(=O)(=O)N2CCC(C(=O)N[C@H](C)C(=O)NCc3ccco3)CC2)cc1
99001517 15 O=C(NCCOCCO)OCC1[C@H]2CCC#CCC[C@@H]12
99002251 19 Cc1ccc(N/C=C(/C#N)C(=O)NC(=O)Cc2ccccc2)c(O)c1
99003537 23 CC(C)C[C@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
99003538 23 CC(C)C[C@@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
99005028 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@@H]1CCCC[C@@H]1C
99005031 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@H]1CCCC[C@@H]1C
99006292 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@H](C(=O)NC3CCCCC3)C2)c1
99006293 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@@H](C(=O)NC3CCCCC3)C2)c1
What’s fun is that RDKit and OEChem both implement mol.GetAtoms()
and atom.GetAtomicNum()
so it’s trivial to port the above from
RDKit to OEChem; replace rdkit_toolkit
with
openeye_toolkit
!
The Open Babel port isn’t quite as easy because Open Babel has a different way to get the atoms in a molecule. To make it easy to copy and paste, here’s the equivalent code for Open Babel:
from itertools import islice
from chemfp import openbabel_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with openbabel_toolkit.read_ids_and_molecules(filename) as reader:
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom_idx in range(mol.NumAtoms())
if mol.GetAtom(atom_idx+1).GetAtomicNum() == 6)
smiles = openbabel_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
Finally, here’s the implementation for CDK:
from itertools import islice
from chemfp import cdk_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with cdk_toolkit.read_ids_and_molecules(filename) as reader:
for id, mol in islice(reader, 0, 10):
num_carbons = sum(a.getAtomicNumber() == 6 for a in mol.atoms())
smiles = cdk_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
If you run this you’ll notice that CDK’s SDF reader keeps the
hydrogens as explicit atoms because the above generates SMILES strings
like C1([H])=C([H])C2=C(C([H])=C1....
. Chemfp 4.2 introduced the
“hydrogens” reader_arg with one of the following three values:
“make-explicit” - convert implicit hydrogens to explicit
“make-implicit” - convert explicit hydrogens to implicit
“make-nonchiral-implicit” - convert explicit non-chiral hydrogens to implicit
The following uses “make-implicit” to convert explicit hydrogens to implicit.
from itertools import islice
from chemfp import cdk_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with cdk_toolkit.read_ids_and_molecules(filename,
reader_args={"hydrogens": "make-implicit"}) as reader:
for id, mol in islice(reader, 0, 10):
num_carbons = sum(a.getAtomicNumber() == 6 for a in mol.atoms())
smiles = cdk_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
Read ids and molecules using an SD tag for the id¶
In this section you’ll learn how to use the id_tag to get the id from one of the SD tags, rather than from the record’s title. You will need the Compound_099000001_099500000.sdf.gz from PubChem, which was used in the previous section. I’ll also explain an idiom for how to count the number of records in an iterator.
Sometimes you would rather use a tag value as the id rather than the title line of the SDF record. This is critical for ChEBI data set and older ChEMBL data sets, which leave the title line (mostly) blank. In this case, use the id_tag to specify the tag to use.
The following example modifies the RDKit code from previous code to use PUBCHEM_IUPAC_SYSTEMATIC_NAME as the id, rather than the title line:
from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
reader = rdkit_toolkit.read_ids_and_molecules(filename,
id_tag="PUBCHEM_IUPAC_SYSTEMATIC_NAME")
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
smiles = rdkit_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
The output is:
3-[(3R)-2,5-bis(oxidanylidene)-3,4-dihydro-1H-1,4-benzodiazepin-3-yl]-N-quinolin-5-yl-propanamide 21 O=C(CC[C@H]1NC(=O)c2ccccc2NC1=O)Nc1cccc2ncccc12
N-[(2R)-1-(furan-2-ylmethylamino)-1-oxidanylidene-propan-2-yl]-1-(4-methoxyphenyl)sulfonyl-piperidine-4-carboxamide 21 COc1ccc(S(=O)(=O)N2CCC(C(=O)N[C@H](C)C(=O)NCc3ccco3)CC2)cc1
[(1S,8R)-9-bicyclo[6.1.0]non-4-ynyl]methyl N-[2-(2-hydroxyethyloxy)ethyl]carbamate 15 O=C(NCCOCCO)OCC1[C@H]2CCC#CCC[C@@H]12
(Z)-2-cyano-3-[(4-methyl-2-oxidanyl-phenyl)amino]-N-(2-phenylethanoyl)prop-2-enamide 19 Cc1ccc(N/C=C(/C#N)C(=O)NC(=O)Cc2ccccc2)c(O)c1
N-[(1S)-1-(1H-benzimidazol-2-yl)-3-methyl-butyl]-2-(1-methylindol-3-yl)ethanamide 23 CC(C)C[C@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
N-[(1R)-1-(1H-benzimidazol-2-yl)-3-methyl-butyl]-2-(1-methylindol-3-yl)ethanamide 23 CC(C)C[C@@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
[(2S)-1-[[(1R,2S)-2-methylcyclohexyl]amino]-1-oxidanylidene-propan-2-yl] (E)-3-phenylprop-2-enoate 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@@H]1CCCC[C@@H]1C
[(2S)-1-[[(1S,2S)-2-methylcyclohexyl]amino]-1-oxidanylidene-propan-2-yl] (E)-3-phenylprop-2-enoate 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@H]1CCCC[C@@H]1C
(3S)-N-cyclohexyl-1-(2,5-dimethylphenyl)sulfonyl-piperidine-3-carboxamide 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@H](C(=O)NC3CCCCC3)C2)c1
(3R)-N-cyclohexyl-1-(2,5-dimethylphenyl)sulfonyl-piperidine-3-carboxamide 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@@H](C(=O)NC3CCCCC3)C2)c1
You might have found the sum(1 for atom in ....)
a bit odd. I
agree with you. It is, however, the standard way in Python to count
the number of elements in the iterator which match a given
condition. I’ll break it down so you can understand how it works.
A list comprehension iterates through each element in an iterator (in the following it iterates over the characters in a string) and returns a list:
>>> [c for c in "Hello"]
['H', 'e', 'l', 'l', 'o']
Add an “if” to it to operate on only a subset of the characters:
>>> [c for c in "Hello" if c != "l"]
['H', 'e', 'o']
I could use len() of this to get the number of non-“l” characters, but that would require making a list only to throw it away. There’s another route to the same answer. To get there, use the value 1 for each character rather than the character itself:
>>> [1 for c in "Hello" if c != "l"]
[1, 1, 1]
Then use sum() to sum the values, which in this case is also the number of elements in the list:
>>> sum([1 for c in "Hello" if c != "l"])
3
Unlike len(), sum() only needs an iterator, not a list. I can replace the list comprehension with a generator comprehension, to get:
>>> sum(1 for c in "Hello" if c != "l")
3
Going back to the RDKit/OEChem expression:
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
I hope you can see how this counts the number of atoms in the molecule whose atomic number is 6. Or, if you want another way to think of it, the expression is the same as:
num_carbons = 0
for atom in mol.GetAtoms():
if atom.GetAtomicNum() == 6:
num_carbons += 1
Read from a string instead of a file¶
In this section you’ll learn how to read molecules from a string containing multiple SMILES records.
In the section Read molecules from an SD file or stdin you learned how to read molecules from a structure file or stdin. Sometimes the input structures come from a string. For example, if a web page has a form with a text box, where users can paste in a set of SMILES or SDF records and submit the form, then the web application on the server will likely receive those records as a single string.
When the records are in a string instead of a file, use
read_molecules_from_string()
. It’s very similar
to read_molecules()
, except that the first
parameter, content, is the string instead of the source filename,
and the second parameter, format, is required. (chemfp doesn’t try
to auto-detect the format from the content.)
The following reads the records from a string containing two simple SMILES records and prints the number of non-implicit atoms for each one. I’ve included implementations for all four toolkits; use the one(s) that are available to you:
content = ("C methane 16.04246\n"
"O=O water 31.9988\n")
from chemfp import rdkit_toolkit
for mol in rdkit_toolkit.read_molecules_from_string(content, "smi"):
print("RDKit:", mol.GetNumAtoms())
from chemfp import openeye_toolkit
for mol in openeye_toolkit.read_molecules_from_string(content, "smi"):
print("OEChem:", mol.NumAtoms())
from chemfp import openbabel_toolkit
for mol in openbabel_toolkit.read_molecules_from_string(content, "smi"):
print("Open Babel:", mol.NumAtoms())
from chemfp import cdk_toolkit
for mol in cdk_toolkit.read_molecules_from_string(content, "smi"):
print("CDK:", mol.getAtomCount())
When I run the above (on a computer where all four supported toolkits are installed), the above reports:
RDKit: 1
RDKit: 2
OEChem: 1
OEChem: 2
Open Babel: 1
Open Babel: 2
CDK: 1
CDK: 2
I would like to improve the output a bit to also include the record id
in the output. The toolkit function
read_ids_and_molecules_from_string()
is similar
to read_molecules_from_string()
except that it
iterates through the (id, toolkit molecule) tuple rather than just the
molecule:
>>> from chemfp import rdkit_toolkit
>>> content = ("C methane 16.04246\n"
... "O=O water 31.9988\n")
>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi"):
... print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane 16.04246' 1
RDKit: 'water 31.9988' 2
You can see that the default SMILES reader assumes the rest of the
line is the id. The file and string record readers take a
reader_args parameter just like
parse_id_and_molecule()
. I’ll specify the
“whitespace” delimiter so the parser uses only the second word as the
id:
>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi",
... reader_args={"delimiter": "whitespace"}):
... print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane' 1
RDKit: 'water' 2
See Specify a SMILES delimiter through reader_args for more details about setting the “delimiter” reader_args.
The string readers, like the file readers, also support the id_tag option to get the id from an SD tag instead of the title line. See Read ids and molecules using an SD tag for the id for more details about using the id_tag.
The reader may reuse molecule objects!¶
In this section you’ll learn that the OEChem and Open Babel toolkits reuse the same molecule object, which means you can’t save a molecule for later.
Suppose you want to read all of the molecules from a file into a list. It’s very tempting to write it as:
>>> import chemfp
>>> from chemfp import openeye_toolkit as T
>>> mols = list(T.read_molecules_from_string("C methane\nO water\n", "smi"))
This does not work for the openeye_toolkit or the openbabel_toolkit:
>>> mols
[<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10326ba40> >,
<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10326ba40> >]
>>> T.create_string(mols[0], "smistring")
''
>>> [T.create_string(mol, "smistring") for mol in mols]
['', '']
This is because the underlying reader for those two toolkits reuse the same molecule object. You can see that in the above, which returns the same OEGraphMol object (with id 0x10326ba40) for each record. The reason why OpenEye decided to reuse the object is to get better performance. Clearing the molecule object is faster than deleting it and reallocating a new one.
In addition, the OEChem reader code does a “clear molecule” followed by “read next record or stop”. At the end of the file there is no record, so the reader ends with a clear molecule. That explains why the OEGraphMol produces an empty SMILES string for the last couple of lines in the above code.
The only portable way to load a list of molecules is to use
copy_molecule()
, as in:
>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in T.read_molecules_from_string("C methane\nO water\n", "smi")]
>>> mols
[<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10328a810> >,
<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x100c78320> >]
>>> T.create_string(mols[0], "smistring")
'C'
>>> T.create_string(mols[1], "smistring")
'O'
I don’t really like this solution because the RDKit reader doesn’t need a copy, so the extra copy is pure overhead.
Future versions of chemfp will likely have a reader_arg to specify if it’s okay to reuse a molecule object or if a new one must be used each time.
Write molecules to a SMILES file¶
In this section you will learn how to write toolkit molecules into a structure file. You will need Compound_099000001_099500000.sdf.gz from PubChem.
Chemfp can write toolkit molecules to a file in a given format. I’ll start by making an RDKit molecule, though the same API works with Open Babel and OEChem:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("c1ccccc1O phenol", "smi")
Use open_molecule_writer()
to create a writer
object. By default it will look at the output filename extension to
figure out the format and compression type, and if that doesn’t work
it defaults to SMILES output:
>>> writer = T.open_molecule_writer("example.smi")
The fingerprint writer has several methods to write a molecule to the file. If you write a molecule by itself it will use the molecule’s own id (in this case, “phenol”):
>>> writer.write_molecule(mol)
Or, use write_id_and_molecule()
if you want to
specify an alternate id:
>>> writer.write_id_and_molecule("something else", mol)
Warning
The toolkit implementation may temporarily change the toolkit molecule’s own identifier in order to get the correct output. You should not alter the molecule’s id in another thread while calling this function.
Let’s see if it worked, by closing the writer (otherwise some of the output may be in an internal buffer) and reading the file:
>>> writer.close()
>>> print(open("example.smi").read())
Oc1ccccc1 phenol
Oc1ccccc1 something else
The write_molecules()
method is optimized
for passing in a list or iterator of molecule objects, and
write_ids_and_molecules()
is the equivalent
if you have (id, molecule) pairs. For example, the following converts
an SD file into a compressed SMILES file:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
reader = T.read_molecules("Compound_099000001_099500000.sdf.gz")
writer = T.open_molecule_writer("example.smi.gz")
writer.write_molecules(reader)
# These are optional, but recommended. Even better would be
# to use the context manager described in the next section.
writer.close()
reader.close()
If you have a list (or iterator) of molecules, then use the
write_molecules()
method.
The open function also supports the format parameter, so you can specify “smi” or “sdf.gz” some other combination of structure format and compression type:
writer = T.open_molecule("wrong_extension.smi", format="sdf.gz")
If the zstandard
package is available then use the .zst
suffiz
for ZStandard compression.
Reader and writer context managers¶
In this section you’ll learn how to use chemfp’s readers and writers to close the file, rather than depend on Python’s garbage collector or manual “close()”. You will need Compound_099000001_099500000.sdf.gz from PubChem.
In the previous section, Write molecules to a SMILES file, you learned how to convert an SD file into a SMILES file. At the end was a small program with optional “close()” statements. These are optional because Python’s garbage collector and chemfp work together. When a chemfp reader or writer is no longer needed, the garbage collector asks chemfp to clean up, and chemfp closes the native toolkit’s file object.
This is fine for a simple script or function, but sometimes you want
more control over when the file is closed. You can call the writer’s
close()
method yourself, but it’s really easy to
forget to do that.
Python supports “context managers”, which carry out certain actions when a block of code finishes. See PEP 343 if you want the full details. For chemfp you only need to know that the reader and writer context managers will always close the file at the end of the block.
A normal Python file context manager works like this:
>>> with open("example.txt", "w") as outfile:
... outfile.write("I am here.\n")
...
>>> print(repr(open("example.txt").read()))
'I am here.\n'
If instead I use one file object to write the data and another to read the file, without a flush() or close() by the writer, then there’s a syncronization problem:
>>> outfile = open("example.txt", "w")
>>> outfile.write("I am here.\n")
>>> print(repr(open("example.txt").read()))
''
Why does this print the empty string? The output text is still in an internal buffer, which isn’t written to the disk until the close call:
>>> outfile.close()
>>> print(repr(open("example.txt").read())
'I am here.\n'
The same problem occurs with molecule output:
>>> from chemfp import rdkit_toolkit as T # can also use openbabel_toolkit
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> writer = T.open_molecule_writer("example.smi")
>>> writer.write_molecule(mol)
>>> open("example.smi").read()
''
>>> writer.close()
>>> open("example.smi").read()
'C=O carbon monoxide\n'
Note: this problem does not occur with the openeye_toolkit. Most likely that toolkit always flushes its output buffers after each molecule.
The chemfp readers and writers support a context manager, so you can use the same solution you would for regular files:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> with T.open_molecule_writer("example.smi") as writer:
... writer.write_molecule(mol)
...
>>> open("example.smi").read()
'C=O carbon monoxide\n'
With the context manager concept firmly in mind, the following is the way I prefer to write the conversion script from the previous section:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
with T.read_molecules("Compound_099000001_099500000.sdf.gz") as reader:
with T.open_molecule_writer("example.smi.gz") as writer:
writer.write_molecules(reader)
That said, if you really want to depend on the garbage collector, you can also write it with one (or two) fewer lines:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
T.open_molecule_writer("example.smi.gz").write_molecules(
T.read_molecules("Compound_099000001_099500000.sdf.gz"))
Write molecules to stdout in a specified format¶
In this section you’ll learn how to specify the structure writer’s output format, and to write to stdout instead of to a file.
The toolkit API function open_molecule_writer()
supports a format parameter, in case you don’t want chemfp to
determine the output format and compression based on the filename
extension.
For example, if the destination is None (instead of a filename) then chemfp will write the output to stdout. Since Python’s None object doesn’t have an extension, it will write the molecules as uncompressed SMILES. If you want to write to stdout in SDF format you will have to specify the output format, like the following:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("O=O molecular oxygen", "smi")
>>> with T.open_molecule_writer(None, "sdf") as writer:
... writer.write_molecule(mol)
...
molecular oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
$$$$
>>> with T.open_molecule_writer(None, "inchikey") as writer:
... writer.write_molecule(mol)
...
MYMOFIZGZYHOMD-UHFFFAOYSA-N molecular oxygen
Write molecules to a string (and a bit of InChI)¶
In this section you’ll learn how to write toolkit molecules into memory, and when finished to get the result as a string.
The previous sections showed examples of writing molecules to a file
or to stdout. Sometimes you want to save the records as a string;
perhaps to send a response for a web request or display the contents
in a text pane of a GUI. The toolkit function
open_molecule_writer_to_string()
creates a
MoleculeStringWriter
which stores the output records into
memory. Once the writer is closed, the memory contents can be
retrieved as a string with MoleculeStringWriter.getvalue()
.
For a bit of variation, the following example uses the “inchi” output
format, and the openbabel_toolkit
:
>>> from chemfp import openbabel_toolkit as T # use your toolkit of choice
>>> alanine = T.parse_molecule("O=C(O)[C@@H](N)C alanine", "smi")
>>> glycine = T.parse_molecule("C(C(=O)O)N glycine", "smi")
>>> writer = T.open_molecule_writer_to_string("inchi")
>>> writer.write_molecules([alanine, glycine])
>>> writer.close()
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 alanine
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5) glycine
You should know that there’s no well-defined “inchi” file format, only an InChI string. I decided to follow Open Babel’s lead and say that the “inchi” format has one record per line, where each line contains the InChI string followed by a delimiter, followed by the id (if available) on the rest of the line.
The InChI output writer_args supports an “include_id” parameter. The default, True, includes the id, while the following example sets it to False to have only the InChI string on the line:
>>> with T.open_molecule_writer_to_string("inchi",
... writer_args={"include_id": False}) as writer:
... writer.write_molecule(alanine)
... writer.write_molecule(glycine)
...
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)
I also used the context manager so the code would be a bit shorter
and, I think, clearer. It’s up to you to decide if
write_molecules()
with a
2-element list is clear than two write_molecule()
lines.
Handling errors when reading molecules from a string¶
In this section you’ll learn how to ignore errors and improve error reporting when reading from a string, rather then accept the default of raising an exception and stopping. The examples will use a string containing SMILES records, but the same principles apply to any format.
If you’ve used the chemfp readers on real-world data sets you might have noticed that the RDKit and Open Babel ones sometimes raise an exception, saying that a given record could not be parsed. I’ll demonstrate with a string containing four SMILES records:
>>> content = ("C methane\n" +
... "CN(C)(C)(C)C pentavalent nitrogen\n" +
... "Q Q-ane\n" +
... "[U] uranium\n")
>>>
RDKit doesn’t like the pentavalent nitrogen, and chemfp’s rdkit_toolkit stops processing at that record:
>>> from chemfp import rdkit_toolkit
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
[15:43:32] Explicit valence for atom # 1 N, 5, is greater than permitted
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/_rdkit_toolkit.py", line 324, in _iter_read_smiles_ids_and_molecules
error_handler.error(
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'CN(C)(C)(C)C',
file '<string>', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'
Open Babel doesn’t care about the too-high valence on the nitrogen, but doesn’t like the non-SMILES in the third record:
>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
==============================
*** Open Babel Error in ParseSimple
SMILES string contains a character 'Q' which is invalid
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/_openbabel_toolkit.py", line 879, in _iter_column_records
error_handler.error("Open Babel cannot parse the %s %s"
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q',
file '<string>', line 3, record #3: first line is 'Q Q-ane'
Neither does CDK:
>>> from chemfp import cdk_toolkit
>>> with cdk_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Error while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException: could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/_cdk_toolkit.py", line 617, in _iter_read_smiles_ids_and_molecules_cdk
error_handler.error("CDK cannot parse a SMILES record", location)
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: CDK cannot parse a SMILES record, file '<string>', line 3, record #3: first line is 'Q Q-ane'
To round things out, OEChem accepts pentavalent nitrogen and skips the bad SMILES at a lower level than what chemfp uses, so there’s no exception:
>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
Warning: Problem parsing SMILES:
Warning: Q Q-ane
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
uranium
I’ll emphasize that point. The openeye_toolkit uses OEChem’s high-level reader, which provides no information about if OEChem skipped a record with a failure. Chemfp therefore cannot provide more information about the failures, whether as an exception or an improved error message.
I’m certain that nearly everyone wants the reader to ignore the few records that can’t be parsed by the underlying toolkit. The readers and writers support the errors option. The default value of “strict” tells chemfp to raise an exception when it detects a parse failure, and “ignore” tells it to ignore the error and go on to the next record:
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="ignore") as reader:
... for id, mol in reader:
... print(id)
...
methane
[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted
[16:13:45] SMILES Parse Error: syntax error for input: 'Q'
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="ignore") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
uranium
>>> with cdk_toolkit.read_ids_and_molecules_from_string
... content, "smi", errors="ignore") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Error
while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException:
could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
uranium
The “strict” default comes from my long-held belief that it’s better to be strict first, and detect problems early, than to let them intrude. My resolve is weakening, because it’s been rare to find that I can make use of that information. The biggest counter-example is when I specify one format but the file is actually in another format, in which case the reader skips a lot of garbage. For example, a SMILES reader, pointed to a SD file or a compressed SMILES file, will try hard to make sense of the data and end up ignoring almost everything. I haven’t decided if I will change the default policy.
I’ve also found that the toolkits aren’t that helpful at identifying which record failed. Take a look at the RDKit warning:
[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted
It says that I did this in the late afternoon, and the reason for the failure, but says very little about the record with the problem.
To help improve this, and to send still more garbage, err, I mean helpful messages to stderr, chemfp supports a “report” errors value. It’s the same as “ignore” except that it also displays more details about the failure location:
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="report") as reader:
... for id, mol in reader:
... print(id)
...
methane
[16:14:52] Explicit valence for atom # 1 N, 5, is greater than permitted
ERROR: RDKit cannot parse the SMILES 'CN(C)(C)(C)C', file '<string>', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'. Skipping.
[16:14:52] SMILES Parse Error: syntax error for input: 'Q'
ERROR: RDKit cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="report") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
ERROR: Open Babel cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
>>> with cdk_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="report") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR:
Error while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException:
could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
ERROR: CDK cannot parse a SMILES record, file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
The quality of the error message depends on the toolkit and the format. The best messages are for the Open Babel and RDKit SMILES readers and InChI readers, because I decided to have chemfp identify the records for those formats itself, instead of using the underlying toolkits to read the file. Chemfp still uses the underlying toolkit to convert the individual record into a native toolkit molecule.
I did this because I found the the SMILES and InChI reader performance was the same, and by writing my own parsers I had the ability to report line numbers and improve the error messages.
The examples so far used the read_ids_and_molecules_from_string
function. The
read_molecules_from_string
function also supports the
errors option, with the same meaning.
>>> sizes = []
>>> with openbabel_toolkit.read_molecules_from_string(
... content, "smi", errors="report") as reader:
... for mol in reader:
... sizes.append(mol.NumAtoms())
...
ERROR: Open Babel cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
>>> sizes
[1, 6, 1]
Handling errors when reading molecules from a file¶
In this section you’ll learn how to how to ignore errors and improve error reporting when reading from SD file, rather then accept the default of raising an exception and stopping. The examples will use an SD file, but the same principles apply to any format.
In the previous section you learned that when the readers encounter a
error, the default behavior is to raise a Python exception and how to
use the error
parameter to ignore those errors or to provide a
more detailed error report.
The file-based readers, chemfp.toolkit.read_molecules()
and
chemfp.toolkit.read_ids_and_molecules()
, can be configured the
same way, that is:
# When there is an error, raise an exception and stop (this is the default)
T.read_molecules(filename)
T.read_molecules(filename, errors="strict")
T.read_ids_and_molecules(filename)
T.read_ids_and_molecules(filename, errors="strict")
# When there is an error, go on to the next record
T.read_molecules(filename, errors="ignore")
T.read_ids_and_molecules(filename, errors="ignore")
# When there is an error, print an error message to stderr then
# go on to the next record
T.read_molecules(filename, errors="report")
T.read_ids_and_molecules(filename, errors="report")
To show it in action, I’ll construct an SD file with three records. The first will contain a trivalent oxygen, the second a corrupt record, and the third will be atomic nitrogen. I’ll use OEChem to help me make the file.
from chemfp import openeye_toolkit as T
mol1 = T.parse_molecule("O#C trivalent", "smi") # RDKit won't like this
mol2 = T.parse_molecule("[U] Q-record", "smi") # I'll corrupt this record
mol3 = T.parse_molecule("[N] nitrogen", "smi") # This one is fine
with T.open_molecule_writer_to_string("sdf") as writer:
writer.write_molecules([mol1, mol2, mol3])
content = writer.getvalue()
# replace the "U" with the nonsense "Qq"
content = content.replace("U ", "Qq")
# Save
open("bad_data.sdf", "w").write(content)
Here’s what the output file bad_data.sdf
it looks like, so you can
copy&paste if you wish:
trivalent
-OEChem-04251716112D
2 1 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0 0 0 0
M END
$$$$
Q-record
-OEChem-04251716112D
1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 Qq 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
nitrogen
-OEChem-04251716112D
1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 N 0 0 0 0 0 15 0 0 0 0 0 0
M END
$$$$
I’ll try to read that file using the native RDKit reader, which skips records it can’t parse:
>>> from rdkit import Chem
>>> reader = Chem.ForwardSDMolSupplier("bad_data.sdf")
>>> ids = [mol.GetProp("_Name") for mol in reader if mol is not None]
[15:51:50] Explicit valence for atom # 0 O, 3, is greater than permitted
[15:51:50] ERROR: Could not sanitize molecule ending on line 8
[15:51:50] ERROR: Explicit valence for atom # 0 O, 3, is greater than permitted
[15:51:50]
****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 93 in file /Users/runner/work/rdkit-pypi/rdkit-pypi/build/temp.macosx-11.0-arm64-cpython-312/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[15:51:50] ERROR: Element 'Qq' not found
[15:51:50] ERROR: moving to the beginning of the next molecule
>>> ids
['nitrogen']
As expected, RDKit could only extract one record of the three. It helpfully points out the line number of a record it couldn’t parse (“ending on line 8”).
Now I’ll do the same using chemfp’s rdkit_toolkit
interface
and the default error handler, which is strict
:
>>> from chemfp import rdkit_toolkit
>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("bad_data.sdf"):
... ids.append(id)
...
[15:54:39] Explicit valence for atom # 0 O, 3, is greater than permitted
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/_rdkit_toolkit.py", line 1367, in _iter_read_sdf_structures
error_handler.error("Could not parse molecule block", location)
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'
It stops at the first error and raise an exception. The exception contains some information about the error location, including the filename, line number, record number, and the contents of the first line of the file.
How does chemfp get that information? Under the covers chemfp uses its
own parser, from the text_toolkit
to read each record, then
passes that record to RDKit to turn the record into a molecule. This
gives chemfp a bit more control over error reporting. Originally this
was also faster than using RDKit’s own ForwardSDMolSupplier
, but
now chemfp is slightly slower. A future implementation may offer a
run-time choice of which implementation to use, in case you want
better performance at the expense of less detailed error information.
Pass in either “ignore” or “report” as the errors
option if you
want chemfp to skip records with an error keep on processing. I’ll use
“report” to show what the error reporting looks like:
>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules(
... "bad_data.sdf", errors="report"):
... ids.append(id)
...
[15:56:49] Explicit valence for atom # 0 O, 3, is greater than permitted
Error: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'. Skipping.
[15:56:49]
****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 93 in file /Users/runner/work/rdkit-pypi/rdkit-pypi/build/temp.macosx-11.0-arm64-cpython-312/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[15:56:49] Element 'Qq' not found
Error: Could not parse molecule block, file 'bad_data.sdf', line 10, record #2: first line is 'Q-record'. Skipping.
>>> ids
['nitrogen']
RDKit’s own error messages from ForwardSDMolSupplier, like “moving to the begining of the next molecule”, have disappeared, because chemfp handles record extraction. The sanitization error message about explicit valence remains because RDKit still does that work.
That was RDKit. What about Open Babel?
>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules(
... "bad_data.sdf", "sdf", errors="strict") as reader:
... for id, mol in reader:
... print("Read", repr(id), "first atom:", mol.GetAtom(1).GetAtomicNum())
...
Read 'trivalent' first atom: 8
Read 'Q-record' first atom: 0
Read 'nitrogen' first atom: 7
Open Babel reads all three records even in strict mode. Interestingly, Open Babel turns the ‘Qq’ atom into a “*” atom, with atomic number 0. To double check, I’ll read the list of molecules, then write them all out as SMILES:
>>> with openbabel_toolkit.read_molecules("bad_data.sdf") as reader:
... mols = [openbabel_toolkit.copy_molecule(mol) for mol in reader]
...
>>> len(mols)
3
>>> with openbabel_toolkit.open_molecule_writer(None, "smi") as writer:
... writer.write_molecules(mols)
...
C#[O] trivalent
* Q-record
[N] nitrogen
OEChem also parses that “Qq” record as an atom with atomic number of 0, and it also doesn’t give me a warning message:
>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules(
... "bad_data.sdf", errors="strict") as reader:
... for id, mol in reader:
... print("Read", repr(id), [a.GetAtomicNum() for a in mol.GetAtoms()])
...
Read 'trivalent' [8, 6]
Read 'Q-record' [0]
Read 'nitrogen' [7]
I totally didn’t expect the toolkits to parse an unknown atom type like “Qq”!
In any case, OEChem will skip records which it could not parse, and there’s no easy way for chemfp to get that information, so in practice the “strict” and “report” options are meaningless.
Finally, what about CDK?
>>> from chemfp import cdk_toolkit
>>> with cdk_toolkit.read_ids_and_molecules(
... "bad_data.sdf", "sdf", errors="strict") as reader:
... for id, mol in reader:
... print("Read", repr(id), "first atom:", list(mol.atoms())[0].getAtomicNumber())
...
Read 'trivalent' first atom: 8
Read 'Q-record' first atom: 0
Read 'nitrogen' first atom: 7
It accepts the “Qq” atom type! That’s because CDK default to a
RELAXED
parser mode. I’ll change that reader_args to
STRICT
:
>>> with cdk_toolkit.read_ids_and_molecules(
... "bad_data.sdf", "sdf", errors="strict", reader_args={"mode": "STRICT"}) as reader:
... for id, mol in reader:
... print("Read", repr(id), "first atom:", list(mol.atoms())[0].getAtomicNumber())
...
Read 'trivalent' first atom: 8
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 5: 0.0000 0.0000 0.0000 Qq 0 0 0 0 0 0 0 0 0 0 0 0 -> invalid symbol: Qq
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading next molecule: invalid symbol: Qq
Read 'nitrogen' first atom: 7
Ignore errors in create_string() and create_bytes()¶
In this section you’ll learn how to ignore errors when converting a molecule into a string or byte record.
Some molecules cannot be represented in some formats. The easiest example is the molecule from the SMILES “*”, which contains a single atom with the atomic number 0 and cannot be represented in InChI:
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> T.create_string(mol, "smistring")
'[*]'
>>> T.create_string(mol, "inchistring")
[16:04:21] ERROR: Unsupported in this mode element '*'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/rdkit_toolkit.py", line 551, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "chemfp/base_toolkit.py", line 1465, in create_string
return self._create_string_impl(format_config, mol, id, writer_args, error_handler)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "chemfp/base_toolkit.py", line 1468, in _create_string_impl
return format_config.create_string(mol, id, writer_args, error_handler)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "chemfp/_rdkit_toolkit.py", line 1765, in create_string
error_handler.error("RDKit cannot create the InChI string")
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot create the InChI string
By default the chemfp.toolkit.create_string()
and
chemfp.toolkit.create_bytes()
functions will raise an
exception if the molecule cannot be converted into the given record
format. Use the errors parameter to specify that behavior. Just like
with file reading, the default value is “strict”, “ignore” will return
None if there was an error, and “report” will return None and also
print some information about the failure to stderr.
The following uses “ignore”:
>>> import chemfp
>>> for toolkit in ("openbabel", "rdkit", "openeye", "cdk"):
... T = chemfp.get_toolkit(toolkit)
... mol = T.parse_molecule("*", "smistring")
... result = T.create_string(mol, "inchistring", errors="ignore")
... print(f"{toolkit} returned {result!r}")
...
==============================
*** Open Babel Warning in InChI code
#1 :Unknown element(s): *
==============================
*** Open Babel Error in InChI code
InChI generation failed
openbabel returned None
[16:07:59] ERROR: Unsupported in this mode element '*'
rdkit returned None
openeye returned None
cdk returned None
The following uses “report”. You can see the only addition is the new
line ‘Error: Open Babel cannot create the InChI string. Skipping.’
For a bit of variation, I also changed things to use
create_bytes
instead of
create_string
:
>>> from chemfp import openbabel_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> result = T.create_bytes(mol, "inchistring", errors="report")
==============================
*** Open Babel Warning in InChI code
#0 :Unknown element(s): *
==============================
*** Open Babel Error in InChI code
InChI generation failed
Error: Open Babel cannot create the InChI string. Skipping.
>>> result is None
True
I’ll do the same using the CDK toolkit:
>>> from chemfp import cdk_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> result = T.create_bytes(mol, "inchistring", errors="report")
Error: CDK cannot create the InChI string. Skipping.
Ignore errors when writing molecules¶
In this section you’ll learn how to ignore errors and improve error reporting when writing a file, rather than accept the default of raising an exception and stopping. You will need a copy of ChEBI_lite.sdf.gz.
It’s not unusal for there to be a few input records which cannot be
parsed into a molecule. It’s much less common to come across a
molecule which cannot be turned into a record. The SMILES and SD file
formats are able to handle a wide range of chemistry. Even R-groups,
which can’t directly be expressed as SMILES, can be represented in one
of several conventions, like [*:1]
for R1.
There are no such conventions for InChI. As you saw in the previous section, it’s easy to make a molecule to InChI converter fail if the structure contains a “*” atom.
The toolkit functions open_molecule_writer()
,
open_molecule_writer_to_string()
, and
open_molecule_writer_to_bytes()
return a molecule
writer. This can be used to write a single molecule at a time, or to
write molecule multiples from an iterator.
What happens if I try to convert the ChEBI file into an InChI file using OEChem?
>>> from chemfp import openeye_toolkit as T
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> writer = T.open_molecule_writer("chebi.inchi")
>>> writer.write_molecules(reader)
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/base_toolkit.py", line 295, in write_molecules
raise err[0].with_traceback(err[1])
File "chemfp/_openeye_toolkit.py", line 703, in _gen_write_inchi_structures
error_handler.error(errmsg, location)
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot create the InChI string, file 'chebi.inchi', record #3
The third record could not be converted to an InChI string, and the warning message that OEChem printed to the termal shows that the molecule contained a wildcard atom, that is, the “*” atom. But, did it really?
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(3):
... mol = next(reader)
... print(T.create_string(mol, "smistring"))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[R1]
That shows “R1”, not an R-group. What’s going on? “R1” isn’t even a valid SMILES.
This is an OEChem extension to SMILES. The default output SMILES flavor includes the flag “RGroups”, which
[c]ontrols whether atoms with atomic number zero (as determined by the OEAtomBase::GetAtomicNum method), and a non-zero map index (as determined by the OEAtomBase::GetMapIdx method) should be displayed using the [R1] notation. In this notation, the integer value following the R corresponds to the atom’s map index. When this flag isn’t set, such atoms are written in the Daylight convention [*:1]. – OEChem documenation
I’ll redo the loop but this time disable the RGroup using the
writer_args option to set the flavor to “Default,-RGroups
”, that is,
the default value but without RGroups being set:
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(5):
... mol = next(reader)
... print(T.create_string(mol, "smistring",
... writer_args={"flavor": "Default,-RGroups"}))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[*:1]
C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1C[C@H](C2=O)O)O
c1cc(c(c(c1)Cl)C#N)Cl
That indeed gives [*:1]
which is the wildcard atom that InChI
complains about.
The molecule writers support the same errors
option as the
molecule readers. The default value is “strict
”, which means to
raise an exception. To ignore errors, use “ignore
”, and to ignore
errors except to report a message to standard out, use “report
”.
>>> from chemfp import openeye_toolkit as T # use your toolkit of choice
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID", errors="ignore")
>>> writer = T.open_molecule_writer("chebi.inchi", errors="report")
>>> writer.write_ids_and_molecules(reader)
The first few and last few lines of output are:
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #3. Skipping.
Warning: Unsupported Sgroup information ignored
Warning: Unsupported Sgroup information ignored
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #13. Skipping.
Warning: Stereochemistry corrected on atom number 2 of
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #133. Skipping.
...
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #94443. Skipping.
Warning: Stereochemistry corrected on atom number 8 of
Warning: Stereochemistry corrected on atom number 13 of
Warning: Stereochemistry corrected on atom number 36 of
where the lines starting “ERROR: OEChem” come from chemfp, and the others come from OEChem at a lower-level. (Alas, the “report” isn’t as helpful as it should be. I would like it to include the output id in the error message, but all it gives is the record number. Perhaps it will be in the next release?)
All told, there were 107205 of which 98631 could be written out. I got
these numbers from the writer’s location
property (see
Location information: filename, record_format, recno and output_recno, below). Its recno
is the number of records
sent to the writer, and output_recno
is the number of records
actually written:
>>> writer.location.recno
107205
>>> writer.location.output_recno
98631
Reader and writer format metadata¶
In this section you’ll learn about the format metadata
attribute
of the readers and writers. You will need
Compound_099000001_099500000.sdf.gz
from PubChem if you want to reproduce this for yourself.
Each reader and writer has a metadata
attribute, which stores some
information about the parameters used to open it:
>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_molecules("Compound_099000001_099500000.sdf.gz")
>>> reader.metadata
FormatMetadata(filename='Compound_099000001_099500000.sdf.gz',
record_format='sdf', args={'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True})
>>> writer = T.open_molecule_writer(None, "sdf")
>>> writer.metadata
FormatMetadata(filename='<stdout>', record_format='sdf',
args={'includeStereo': False, 'kekulize': True, 'v3k': False})
The metadata for a structure reader and writer is a
chemfp.base_toolkit.FormatMetadata
instances, and not the
chemfp.Metadata
for a fingerprint reader and writer.
The filename
attribute is best effort at a
string representation of the source or destination. It can either be
the original filename (if there is one), the strings “<stdin>” or
“<stdout>” for stdin/stout, the string “<string>” if reading or
writing to memory, the source or destination’s “name” attribute if a
file object, or None if all else fails.
The record_format
attribute is the format
name for the record, which is the same as the input file format except
without any compression. As you can see in the above example, the
“sdf.gz” reader has a record_format
of “sdf”. This parameter is
useful when you want use the text_toolkit
to extract records
because you pass the text reader’s record format as the format for
the chemistry toolkit’s parse_molecule()
.
The args
attribute is the processed
reader_args or writer_args, without any namespacing. For now it’s
mostly available for debugging purposes, so you can see how the
toolkit layer actually processed your arguments. In the future there
will be a way to turn this into a text settings dictionary.
Location information: filename, record_format, recno and output_recno¶
In this section you’ll learn the basics of the
chemfp.io.Location
API, you’ll learn how to get the location
object for each reader and writer, and you’ll learn about the
recno
and output_recno
location
attributes.
(See the next section for details about the lineno
,
offsets
, record
, and other
location properties which are not available for every toolkit format.)
The reader and writers track information about the current state of
the reader and writer. Some of this information is more generally
useful, and available through the location
attribute of each
reader and writer:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_molecules_from_string(content, "smi")
>>> reader.location
Location('<string>')
>>> loc = reader.location
>>> loc.filename
'<string>'
>>> loc.record_format
'smi'
If there is no actual filename then filename
is
“<string>” for string-based I/O, “<stdin>” when reading from stdin,
and “<stdout>” when writing to stdout. (The latter two occur when the
source or destination parameter, respectively, are None.) The
record_format
is the record format name, without
any compression suffix:
>>> writer = T.open_molecule_writer("example.sdf.gz")
>>> writer.location.filename
'example.sdf.gz'
>>> writer.location.record_format
'sdf'
>>> writer.close()
All of the toolkit readers and writers support the
recno
location property, which is the number of
records which have been read or written. A recno of 0 means that no
records have been read:
>>> reader = T.read_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> loc.recno
0
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x10fb06e50>
>>> loc.recno
1
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x10fb06ec0>
>>> loc.recno
2
While you could use the recno
property for simple
enumeration, as in the folllowing:
>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
... loc = reader.location
... for id, mol in reader:
... print("record number:", loc.recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen
I would prefer that you write it with the “enumerate()” function, as in:
>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
... for recno, (id, mol) in enumerate(reader, 1):
... print("record number:", recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen
The enumerate() function is both faster and more expected for this
sort of code. The recno
property exists more to
help with error reporting, and to report summary information, like:
>>> print("Read", reader.location.recno, "records")
Read 2 records
The output writers distinguish between recno
, which
is the number of molecules that chemfp tried to save, and
output_recno
, which is the number of molecules that
could actually be saved. This occurs because some molecules cannot be
written to a given format, like the SMILES “*” which has no InChI
representation:
>>> from chemfp import openbabel_toolkit
>>> writer = openbabel_toolkit.open_molecule_writer("example.inchi")
>>> parse_molecule = openbabel_toolkit.parse_molecule
>>> writer.write_molecule(parse_molecule("c1ccccc1O", "smistring"))
>>> writer.location.recno
1
>>> writer.location.output_recno
1
>>> writer.write_molecule(parse_molecule("*", "smistring"))
==============================
*** Open Babel Warning in InChI code
#0 :Unknown element(s): *
==============================
*** Open Babel Error in InChI code
InChI generation failed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
... traceback omitted ...
chemfp.ParseError: Open Babel cannot create the InChI string, file 'example.inchi', record #2
>>> writer.location.recno
2
>>> writer.location.output_recno
1
Location information: record position and content¶
In this section you’ll learn how to get position information for each record and information about the content of each record. You will need the RDKit toolkit or Open Babel toolkit. (Unfortunately for me, OEChem doesn’t have a way to get this information, and my hybrid parser with improved error reporting proved to be much slower than OEChem’s native performance.) You will also need Compound_099000001_099500000.sdf.gz from PubChem.
(See the previous section for details about the
filename
, record_format
,
recno
and output_recno
location
properties, which are available for every toolkit format.)
Sometimes you want to know where a record is located in a file. You might want to report that the unusable record started on line 12345 of a given file, or you might want to index a file to implement random access lookup.
The underlying toolkits do not implement this functionality. Instead,
chemfp includes its own SMILES and SDF file readers. These know enough
about the formats to extract a single record, then pass the record to
the toolkit to turn into a molecule. This lets chemfp track the line
number of the start of the record (lineno
), its
byte range (offsets
), the text of the current
record (record
), and other details.
Timings show that the hybrid parser for the SMILES formats are no slower than the native RDKit and Open Babel readers, and that the hybrid SDF parser a bit slower than RDKit’s native parser (about 10%) and slower than Open Babel’s native parser. In all cases, OEChem’s native parsers leave chemfp in the dust.
As a consequence, the rdkit_toolkit
and
openbabel_toolkit
SMILES readers track more detailed record
information, but the openeye_toolkit
one does not. (The
text_toolkit
of course always tracks that information.) Here
is an example which works for rdkit_toolkit and openbabel_toolkit:
>>> from chemfp import openbabel_toolkit as T # or rdkit_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_ids_and_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> for id, mol in reader:
... print(f"id: {id!r} lineno: {loc.lineno} byte range: {loc.offsets}")
... print(f" record content: {loc.record!r}")
... print(f" first line: {loc.first_line!r}")
...
id: 'methane' lineno: 1 byte range: (0, 10)
record content: b'C methane\n'
first line: 'C methane'
id: 'oxygen' lineno: 2 byte range: (10, 21)
record content: b'O=O oxygen\n'
first line: 'O=O oxygen'
>>> content[0:10]
'C methane\n'
>>> content[10:21]
'O=O oxygen\n'
(Note: if the input record is a Unicode string then it will be converted into a UTF-8 encoded byte string. The start and end positions are coordinates in the encoded byte string, not the text string.)
The location
instance of the rdkit_toolkit SDF
reader gives access to many details about the current parser state:
>>> from chemfp import rdkit_toolkit
>>> reader = rdkit_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz")
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x1104c9830>
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 6715)
>>> reader.location.first_line
'99000039'
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x1104c97c0>
>>> reader.location.lineno
223
>>> reader.location.offsets
(6715, 14570)
>>> reader.location.first_line
'99000230'
The openbabel_toolkit
and openeye_toolkit
implementations by default don’t track this level of detail, because
their native readers are faster than when I can manage in a hybrid
reader. Consequently, those values are None:
>>> from chemfp import openbabel_toolkit
>>> reader = openbabel_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz")
>>> next(reader)
<openbabel.openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x106eff4b0> >
>>> print(reader.location.lineno)
None
>>> print(reader.location.offsets)
None
>>> print(reader.location.first_line)
None
There is an option to use Open Babel in hybrid mode. The reader_args supports an “implementation” option. The default of None, or “openbabel”, tells chemfp to use Open Babel’s native parser, while specifying “chemfp” tells it to use chemfp’s own SDF record parser:
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': 'openbabel', 'perceive_stereo': False, 'perceive_0d_stereo': False, 'options': None}
>>> reader = openbabel_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz",
... reader_args={"implementation": "chemfp"})
>>> next(reader)
<openbabel.openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x106effd20> >
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 6715)
>>> reader.location.first_line
'99000039'
The CDK SDF reader also has an “implementation” to choose between “cdk” and “chemfp” as the low-level record reader.
In the future I may add similar support for the
openeye_toolkit
, for those who want the increased level of
location detail despite the large performance impact.
The RDKit SDF reader always uses the hybrid. This is for historical reasons. The hybrid solution was once always faster than the native ForwardSDMolSupplier. That has since changed, and ForwardSDMolSupplier is now slightly faster. At some point I may add an ‘implementation’ option so you can switch between performance and improved error reporting.
Writing your own error handler (Advanced)¶
In this section you’ll learn how to write your own error handler. This is an advanced topic. Bear in mind that this may change. I hope you can provide feedback about how to improve it.
In earlier sections you learned that when the errors parameter is “strict”, the parser will raise an exception if there’s a problem with a record. When it’s “ignore”, the record parsers return None as the molecule, while the file and string readers skip the failing record. When it’s “report”, the result is the same as “ignore” except that extra information about the failure is written to stderr.
The errors parameter can also take an object which implements the “error()” method as in the following:
import sys
class OopsHandler(object):
def error(self, msg, location=None):
if location is None:
sys.stderr.write(f"Oops! {msg}. Skipping.\n")
else:
sys.stderr.write(f"Oops! {msg}, {location.where()}. Skipping.\n")
The msg
is a string describing the error, and location
contains the chemfp.io.Location
for the given record. Here’s
what it looks like in action:
>>> import sys
>>> from chemfp import rdkit_toolkit as T
>>> T.parse_molecule("Q", "smistring", errors=OopsHandler())
[10:54:57] SMILES Parse Error: syntax error while parsing: Q
[10:54:57] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Oops! RDKit cannot parse the SMILES string 'Q'. Skipping.
>>> for mol in T.read_molecules_from_string("Q Q-ane\nC methane\n", "smi",
... errors=OopsHandler()):
... print("Processed", mol)
...
[10:55:21] SMILES Parse Error: syntax error while parsing: Q
[10:55:21] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Oops! RDKit cannot parse the SMILES 'Q', file '<string>', line 1, record #1: first line is 'Q Q-ane'. Skipping.
Processed <rdkit.Chem.rdchem.Mol object at 0x109486670>
The location’s where()
method tries to give useful
information based on the location’s filename, line number, record
number, and the first line of the record (up to the first 40
characters).
It’s easy to see how to modify this to send the errors to a logger, or save them up to display in a GUI.
For the hybrid parsers, which give access to the raw record, you can do more advanced processing, like extract the title lines of any SDF record which RDKit can’t handle. The following will make an SDF-formatted string containing three records, where the second record is a 5-valent nitrogren that RDKit can’t parse. It will then try to parse the string, and store the ids for records which couldn’t be parsed.
from rdkit import Chem
from chemfp import rdkit_toolkit
# Use RDKit to make an SD file which RDKit cannot parse.
methane = rdkit_toolkit.parse_molecule("C methane", "smi")
# Bypass normal sanitization so RDKit will read 5-valent nitrogens
pentavalent_n = rdkit_toolkit.parse_molecule("CN(C)(C)(C)C pentavalent N",
"smi", reader_args={"sanitize": False})
Chem.SanitizeMol(pentavalent_n, Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION)
oxygen = rdkit_toolkit.parse_molecule("O=O oxygen", "smi")
# Use the three molecules to make an SD file as a string
with rdkit_toolkit.open_molecule_writer_to_string("sdf") as writer:
writer.write_molecules([methane, pentavalent_n, oxygen])
sdf_content = writer.getvalue()
# User-defined error handler
class CaptureIds(object):
def __init__(self):
self.ids = []
def error(self, msg, location):
self.ids.append(location.first_line)
capture_ids = CaptureIds()
for mol in rdkit_toolkit.read_molecules_from_string(sdf_content, "sdf",
errors=capture_ids):
pass
print("Could not parse:", capture_ids.ids)
The content of sdf_content
is:
methane
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
pentavalent N
RDKit
6 5 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
2 4 1 0
2 5 1 0
2 6 1 0
M END
$$$$
oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
$$$$
and the output from the above is:
[10:59:59] Explicit valence for atom # 1 N, 5, is greater than permitted
Could not parse: ['pentavalent N']
The fingerprint type documentation includes another example of how to write an error handler.
A Babel-like structure format converter¶
In this section you’ll learn how to use the chemfp toolkit API to create a Babel-like structure file format converter. This section goes into more details of how to develop real-world software using chemfp.
If you really want this sort of converter, see the chemfp translate command-line tool.
Pat Walters and Matt Stahl started Babel in the 1990s as a command-line program to convert from one chemical structure format to another. This developed over the years, and after a major rewrite became the LGPL toolkit “OELib”, OpenEye’s first commercial chemistry toolkit. OpenEye’s next rewrite lead to OEChem, a proprietary chemistry toolkit. OELib was still available, and others continued to develop it. It became Open Babel, and structure file format conversion continues to be Open Babel’s forte.
A full Babel-like program includes features to add and remove hydrogens of different sorts, select or reject structures based on substructure or other features, add 2D or 3D coordinates, and more. You cannot use chemfp for that. All chemfp can do is read structure files into a given toolkit’s molecule object, and write molecule objects to a given format.
Even that basic ability is useful. I’ll explain how to write such a converter yourself. I’ll use as my example file the following, “example.smi”:
c1ccccc1O phenol
C methane
O=O molecular oxygen
Here’s a minimal conversion program to convert the above into “example.sdf”:
from chemfp import rdkit_toolkit as T # use your toolkit of choice
reader = T.read_molecules("example.smi")
writer = T.open_molecule_writer("example.sdf")
writer.write_molecules(reader)
That code depends on Python’s garbage collection to close the output file handle. This is fine for a script, but a longer running program may want to have more explicit control over closing the file handle and use a context manager (see Reader and writer context managers):
from chemfp import rdkit_toolkit as T # use your toolkit of choice
with T.read_molecules("example.smi") as reader:
with T.open_molecule_writer("example.sdf") as writer:
writer.write_molecules(reader)
With that we have enough to build our first Babel program, which takes the input and output filenames on the command-line. I’ll call this program “cbabel.py”, for “chemfp babel”, and have it implement the command-line
usage: cbabel.py [-h] input_filename output_filename
I’ll use argparse from Python’s
standard library to handle command-line argument processing. The
“nargs=1
” in the following says that the input_filename and
output_filename must exist, and only one filename is
allowed. Argparse will save those in a list of size 1, which is why I
use [0] to get the actual string I’m interested in:
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
with T.read_molecules(args.input_filename[0]) as reader:
with T.open_molecule_writer(args.output_filename[0]) as writer:
writer.write_molecules(reader)
I’ll convert the SMILES into canonical SMILES:
% python cbabel.py example.smi example.can
% cat example.can
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
The only change is that the phenol went from c1ccccc1O
to
Oc1ccccc1
.
I’ll add the ability to read from stdin and stdout. I’ll say that if the input filename is “-” then it will read from stdin, and if the output filename is “-” then it will write to stdout. (If you have a file named “-” then you’ll have to specify “./-” to read or write to it.):
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename) as reader:
with T.open_molecule_writer(output_filename) as writer:
writer.write_molecules(reader)
There’s a limitation with this! When the input or output format is None, chemfp can’t figure out the format based on the filename, so will assume that it’s a SMILES file. When I run the above I get SMILES output:
% python cbabel.py example.smi -
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
But what if I want SDF output? I need a way to specify the input and
output file formats on the command-line. I’ll use the -i
and
-o
options to specify those:
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
help="output format name", default=None)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename, args.input_format) as reader:
with T.open_molecule_writer(output_filename, args.output_format) as writer:
writer.write_molecules(reader)
Now I can specify that I want stdout to be in SDF format:
% python cbabel.py -o sdf example.smi - | head -5
phenol
RDKit
7 7 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
In practice, required command-line arguments make life more difficult. For a simple program like this, required arguments are not a problem, but what if I want to add a command to list the available formats? That option doesn’t need an input or output filename, but argparse will enforce that requirement anyway.
There are a couple of ways to solve the problem. The correct one is to use an argparse Action but that’s complicated. An easier one for this case is to let “-” be the default input and output filename. That’s easily done by changing:
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
to:
parser.add_argument("input_filename", nargs="?", default="-",
help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
help="output filename")
As a result I can add a new --help-formats
argument:
parser.add_argument("--help-formats", action="store_true",
help="list the available file formats")
along with the handler for it, which prints information about the toolkit (its name and version string) and each of the formats. Some of the formats, like “smistring”, don’t have an I/O format (for that, use “smi”), so I need to filter those out. Also, some of the formats, like “inchikey”, are output only, and some of the toolkit have formats that they read but don’t write, so I give more details about those:
args = parser.parse_args()
if args.help_formats:
print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
for format in T.get_formats():
if not format.supports_io: # skip formats like "smistring" and "inchistring"
continue
if not format.is_output_format:
msg = " (input only)"
elif not format.is_input_format:
msg = " (output only)"
else:
msg = ""
print(" %s%s" % (format.name, msg))
raise SystemExit(0)
For my version of RDKit I get:
% python cbabel.py --help-formats
Available I/O formats for toolkit rdkit (RDKit/2020.03.1)
smi
can
usm
sdf
fasta
mol2 (input only)
pdb
xyz (output only)
mae (input only)
inchi
inchikey (output only)
If I used openeye_toolkit instead of rdkit_toolkit I get:
Available I/O formats for toolkit openeye (OEChem/20191016)
smi
usm
can
sdf
skc (input only)
mol2
mol2h
sln (output only)
mmod
pdb
xyz
cdx
mopac (output only)
mf (output only)
oeb
inchi
inchikey (output only)
oez
cif
mmcif
fasta
csv
json
The code so far requires RDKit, but chemfp supports OEChem, Open
Babel, and CDK. Why not add the command-line argument
--toolkit
to specify an alternate toolkit?
I’ll tell argparse that there’s a new --toolkit
argument,
which defaults to “rdkit” and also allows “openeye” and “openbabel”:
parser.add_argument("--toolkit", metavar="NAME",
choices=("rdkit", "openeye", "openbabel", "cdk"),
help="toolkit name", default="rdkit")
I can no longer import the toolkit directly, which I did as:
from chemfp import rdkit_toolkit as T
because that line requires that RDKit be installed. Otherwise it will raise an ImportError exception. While that might be reasonable if the user wanted to use the rdkit toolkit, it’s not reasonable if the user wanted to use the Open Babel toolkit and didn’t care to know that RDKit isn’t available.
Instead of a direct import, I’ll use chemfp.get_toolkit()
to get
the named toolkit. It raises a ValueError with a useful error message
if the toolkit isn’t available or is unknown. If that happens, I’ll
exit, and use that message as the explanation:
import chemfp
# ... skipped many lines
try:
T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
raise SystemExit("Cannot use toolkit %s: %s" % (
args.toolkit, err))
After a bit of experimentation I found a small SMILES string which gives a different canonicalization for each of the supported toolkits, which I present as evidence that it really is using a different toolkit:
% echo "NCC(N)O example" | python cbabel.py --toolkit openbabel
NCC(O)N example
% echo "NCC(N)O example" | python cbabel.py --toolkit rdkit
NCC(N)O example
% echo "NCC(N)O example" | python cbabel.py --toolkit openeye
C(C(N)O)N example
% echo "NCC(N)O example" | python cbabel.py --toolkit cdk
C(C(N)O)N example
Here’s the final code, so you can see how everything works in context:
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
help="output format name", default=None)
## parser.add_argument("input_filename", nargs=1, help="input filename")
## parser.add_argument("output_filename", nargs=1, help="output filename")
parser.add_argument("--toolkit", metavar="NAME",
choices=("rdkit", "openeye", "openbabel", "cdk"),
help="toolkit name", default="rdkit")
parser.add_argument("--help-formats", action="store_true",
help="list the available file formats")
parser.add_argument("input_filename", nargs="?", default="-",
help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
help="output filename")
args = parser.parse_args()
try:
T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
raise SystemExit("Cannot use toolkit %s: %s" % (
args.toolkit, err))
if args.help_formats:
print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
for format in T.get_formats():
if not format.supports_io: # skip formats like "smistring" and "inchistring"
continue
if not format.is_output_format:
msg = " (input only)"
elif not format.is_input_format:
msg = " (output only)"
else:
msg = ""
print(" %s%s" % (format.name, msg))
raise SystemExit(0)
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename, args.input_format) as reader:
with T.open_molecule_writer(output_filename, args.output_format) as writer:
writer.write_molecules(reader)
Amazing how the original four lines of code expands to 55. It would be even more if I added full error reporting instead of letting Python throw an exception on errors.
Speaking of errors, you may want to use hard-coded values of
errors="ignore"
or errors="report"
to have the parser skip
records that the toolkit doesn’t understand, or perhaps pass in that
information as a command-line argument named --errors
, with
the possible choices of “strict”, “report”, or “ignore”.
You might also add the -R
and -W
options to set
the reader args and writer args for the formats, but that’s more
complicated than I wanted to describe in this context. See the next
section for a description of how to do it.
argparse text settings to reader and writer args¶
In this section you’ll learn how to use argparse to handle reader args and writer args in the same style that chemfp does.
The previous section showed how to create a Babel-like structure format conversion program and how to use Python’s argparse library for command-line processing. That section was getting too long to describe how to support command-line configuration of the reader args and writer args. In this section I’ll start with a smaller version of the same code. This one requires an input filename and an output filename on the command-line, and lets the user specify the toolkit:
# I put this into a file named "convert.py"
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "Experiment with -R and -W options"
)
parser.add_argument("--toolkit", metavar="NAME",
choices=("rdkit", "openeye", "openbabel", "cdk"),
help="toolkit name", default="rdkit")
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]
with T.read_molecules(source) as reader:
with T.open_molecule_writer(destination) as writer:
writer.write_molecules(reader)
I’ll walk through the process of how to add support for the
-R
and -W
options, to make it possible to say:
python convert.py example.smi example.can --toolkit rdkit -R delimiter=space -W allBondsExplicit=true
How to get from the command-line to reader and writer arguments¶
This requires a few conversions. I need to turn the command-line arguments into reader and writer text settings dictionaries, then convert the text settings into reader_args and and writer_args dictionaries, before finally passing the reader_args and writer_args to the molecule readers and writers. (See Convert text settings into reader and writer arguments for more details about converting text settings to reader and writer arguments.)
I’ll use argparse to place the -R
and -W
option
values into separate lists of KEY=VALUE
strings, and create a new
function which splits them apart on the “=
” to get a dictionary of
text settings. Then I’ll use the Format
object to convert the
text settings into the correct reader_args and writer_args. The steps
will look something like this:
>>> from chemfp import rdkit_toolkit as T
>>>
>>> format = T.get_format("smi") # Specify the format and user-defined settings
>>> reader_settings = ["delimiter=space"]
>>> writer_settings = ["allBondsExplicit=true"]
>>>
>>> # The 'parse_text_settings' function doesn't yet exist. It will convert
... # the list of reader_settings into a dictionary of string values.
...
>>> reader_text_settings = parse_text_settings("-R", reader_settings)
>>> reader_text_settings
{'delimiter': 'space'}
>>>
>>> # Ask the format to turn the string values into string objects
...
>>> format.get_reader_args_from_text_settings(reader_text_settings)
{'delimiter': 'space'}
>>>
>>> # Do the same for the writer arguments
...
>>> writer_text_settings = parse_text_settings("-W", writer_settings)
>>> writer_text_settings
{'allBondsExplicit': 'true'}
>>> format.get_writer_args_from_text_settings(writer_text_settings)
{'allBondsExplicit': True}
For the actual code the input format may be different than the output format. By the way, if you look closely you’ll see how “allBondsExplicit” in the text settings has a value of “true”, and the string was converted to the Python object True to be a writer_arg.
To start! First, I need a way to read the list of -R
and -W
options. I’ll ask argparse to save them into a list, for later
post-processing to get the right values:
parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
help="specify a writer argument", default=[])
This will parse all of the -R
terms, like “-R delimiter=space
”, into
the reader_settings list, and “-W allBondsExplicit=true
” into the
writer_settings list, such that:
args.reader_settings == ["delimiter=space"]
args.writer_settings == ["allBondsExplicit=true"]
For that matter, it will also support “-R abc
”, and put it into
the reader_settings list even though it doesn’t have a “=
” in
it. I also need to go through and figure out if any terms are
incorrect, and report the problem. I’ll make a function for this,
along with a parameter so any error message can report if a problem
comes from the -R
or -W
command-line flag:
def parse_text_settings(flag, terms):
text_settings = {}
for term in terms:
left, mid, right = term.partition("=")
if mid != "=":
parser.error("%s setting %r must be of the form KEY=VALUE" %
(flag, term))
text_settings[left] = right
return text_settings
reader_text_settings = parse_text_settings("-R", args.reader_settings)
writer_text_settings = parse_text_settings("-W", args.writer_settings)
This gives me two text settings dictionaries, where the keys and
values are both strings. I’ll use the respective Format
object
to convert a text setting dictionary into the correct reader and
writer arguments dictionary:
input_format = T.get_input_format_from_source(source)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)
output_format = T.get_output_format_from_destination(destination)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)
All that’s left is to pass the reader_args and writer_args to the reader and writer. Since I already have the input and output format objects, I’ll pass those in as well, rather than have them guess again based on the source and destination names:
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
Converter with -R and -W support¶
Here’s how it looks when I put it all together:
# I put this into a file named "convert.py"
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "Experiment with -R and -W options"
)
parser.add_argument("--toolkit", metavar="NAME",
choices=("rdkit", "openeye", "openbabel", "cdk"),
help="toolkit name", default="rdkit")
parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
help="specify a writer argument", default=[])
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
def parse_text_settings(flag, terms):
text_settings = {}
for term in terms:
left, mid, right = term.partition("=")
if mid != "=":
parser.error("%s setting %r must be of the form KEY=VALUE" %
(flag, term))
text_settings[left] = right
return text_settings
args = parser.parse_args()
T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]
input_format = T.get_input_format_from_source(source)
reader_text_settings = parse_text_settings("-R", args.reader_settings)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)
output_format = T.get_output_format_from_destination(destination)
writer_text_settings = parse_text_settings("-W", args.writer_settings)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
Let’s see it in action. I’ll ask RDKit to include all of the bonds in the output SMILES, including the aromatic bonds, and I’ll ask it to use the space character as the SMILES delimiter:
% python convert.py example.smi example_output.smi --toolkit rdkit -R delimiter=space -W allBondsExplicit=true
% cat example_output.smi
O-c1:c:c:c:c:c:1 phenol
C methane
O=O molecular
The lack of “oxygen” in “molecular oxygen” shows that the input SMILES reader used the “space” delimiter instead of the default “to-eol” delimiter, just as I requested.
The -R
and -W
settings can also be qualified. (See
Qualified reader and writer parameters names.) I’ll have Open Babel and OEChem use
different delimiter styles to get different results:
% python convert.py example.smi example_ob_output.smi --toolkit openbabel \
-R "openbabel.*.delimiter=to-eol" -R "openeye.*.delimiter=whitespace"
% cat example_ob_output.smi
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
%
% python convert.py example.smi example_oe_output.smi --toolkit openeye \
-R "openbabel.*.delimiter=to-eol" -R "openeye.*.delimiter=whitespace"
% cat example_oe_output.smi
c1ccc(cc1)O phenol
C methane
O=O molecular
List the reader and writer arguments for the given formats¶
Finally, it’s difficult to remember all of the available settings for
each input and output format. I’ll add a --list-args
command-line option which shows the available options, and for each
option show the current setting, along with an indicator if the
current setting is the default value for that format or if the setting
comes from the command-line option.
I need argparse to know about the new option:
parser.add_argument("--list-args", action="store_true",
help="list the available input and output options")
and for the rest I replace the last three lines of the earlier code with:
if args.help_args:
# Make a helper function to display the arguments
def report_args(format, msg, default_args, specified_args):
print("%s %s:" % (format.name, msg))
# Merge the arguments; command-line overrides defaults;
all_args = default_args.copy()
all_args.update(specified_args)
for name, value in sorted(all_args.items()):
# Was the name specified via -R/-W or is it a default?
where = "from command-line" if name in specified_args else "default value"
print(" %s: %r (%s)" % (name, value, where))
report_args(input_format, "reader arguments (-R)",
input_format.get_default_reader_args(), reader_args)
report_args(output_format, "writer arguments (-W)",
output_format.get_default_writer_args(), writer_args)
else:
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
(See Get the default reader_args or writer_args for a format for more details on the default reader and writer arguments.)
With those changes, the output using the new --list-args
is:
% python convert.py example.smi example_output.smi --toolkit rdkit \
? -R delimiter=space -W allBondsExplicit=true --list-args
smi reader arguments (-R):
delimiter: 'space' (from command-line)
has_header: False (default value)
sanitize: True (default value)
smi writer arguments (-W):
allBondsExplicit: True (from command-line)
allHsExplicit: False (default value)
canonical: True (default value)
cxsmiles: False (default value)
delimiter: None (default value)
isomericSmiles: True (default value)
kekuleSmiles: False (default value)
Creating a specialized record parser¶
In this section you’ll learn how to make a specialized function to
parse an record into a toolkit molecule. This function is somewhat
faster than calling the more general purpose toolkit function
parse_id_and_molecule()
and might be used when you
need to convert a lot of individual records into a molecule.
Sometimes you need to parse a lot of records which don’t come from a file. For example, substructure search is typically split into a screening stage based on substructure fingerprints, followed by the atom-by-atom substructure search. The screening stage returns identifiers and the substructure search takes molecules, so in between them is code to look up a record based on its id and convert the result to a molecule.
Assuming a database record API where “db[id]
” returns the record
for a given id, that lookup function might look like this:
def get_molecules(db, id_iter, toolkit, format, reader_args=None):
for id in id_iter:
record = db[id]
mol = toolkit.parse_molecule(record, format, reader_args=reader_args)
yield mol
(A more complex implementation should handle when the record id doesn’t exist, or can’t be converted into a molecule.)
This isn’t as fast as it could be. The
parse_molecule()
toolkit function validates that the
format and reader_args are correct and figures out the right
parameters for the underlying toolkit code. It’s a waste of time to
redo those checks for every single call.
The function also promises that the caller will get a new molecule each time. That promise isn’t needed for substructure screening. Timing tests with OEChem show that reusing the same molecule is faster than creating a new one. For example, this OEChem code:
mol = OEGraphMol()
for i in range(100000):
OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")
mol.Clear()
takes about 60% of the time as this code:
for i in range(100000):
mol = OEGraphMol()
OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")
(Bear in mind that this code isn’t doing aromaticity perception, which roughly halves the performance.)
The toolkit function make_id_and_molecule_parser()
returns a specialized function to parse records, based on the
specified parameters:
>>> from chemfp import rdkit_toolkit as T
>>> parser = T.make_id_and_molecule_parser("smi")
>>> parser("c1ccccc1O phenol")
('phenol', <rdkit.Chem.rdchem.Mol object at 0x11254b7b0>)
For RDKit it’s about 10-15% faster to use the specialized function
instead of the general purpose toolkit.parse_molecule()
:
>>> from chemfp import rdkit_toolkit as T
>>> import time
>>>
>>> smiles = "c1ccccc1Oc1ccccc1"
>>> if 1:
... t1 = time.time()
... for i in range(10000):
... mol = T.parse_molecule(smiles, "smi")
... print("Standard time:", time.time()-t1)
...
Standard time: 1.6667978763580322
>>> parser = T.make_id_and_molecule_parser("smi")
>>> if 1:
... t1 = time.time()
... for i in range(10000):
... id, mol = parser(smiles)
... print("Specialized time:", time.time()-t1)
...
Specialized time: 1.5279979705810547
The make_id_and_molecule_parser()
function parameters
are almost identical to the ones in
parse_id_and_molecule()
, and with the same
meaning. The only difference is that
make_id_and_molecule_parser()
does not support the
record parameter. Instead, it returns a function which takes the
record and returns the (id, toolkit molecule) pair:
>>> from chemfp import rdkit_toolkit
>>> parser = rdkit_toolkit.make_id_and_molecule_parser(
... "smi", reader_args={"delimiter": "whitespace"}, errors="ignore")
>>> parser("c1ccccc1O methane 16.04246")
('methane', <rdkit.Chem.rdchem.Mol object at 0x11254bad0>)
>>> parser("Q q-ane")
[11:33:57] SMILES Parse Error: syntax error while parsing: Q
[11:33:57] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
('q-ane', None)
Warning
The function that toolkit.make_id_and_molecule_parser()
returns may reuse the underlying molecule object. Calling the
function again may change the molecule returned in previous call:
>>> from chemfp import openeye_toolkit
>>> parser = openeye_toolkit.make_id_and_molecule_parser("smi")
>>> id, mol = parser("C")
>>> mol.NumAtoms()
1
>>>
>>> parser("CCC")
(None, <openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x1151f5030> >)
>>> mol.NumAtoms()
3
RDKit doesn’t support molecule reuse so rdkit_toolkit
returns
a new molecule. Open Babel does support reuse and
openbabel_toolkit
will reuse the molecule. However, my tests
using Open Babel show a barely detectable performance improvement if I
reuse a molecule vs. creating a new one each time. Future versions of
chemfp may change the default, and may add an implementation option to
specify if a new molecule should be returned each time.
In multithreaded code you should create a new parser for each thread.
You might have noticed there is no “make_molecule_parser()”. While it would be useful, it takes time to develop, test, and document, and it wasn’t useful enough for this release. Let me know if you would like it in the future.
Molecule API: Get and set the molecule id¶
In this section you’ll learn how to get and set the molecule id for a toolkit molecule.
Sometimes you want to get or set toolkit molecule id. This should be pretty rare because the input routines all support a way to get the identifier in parallel with the molecule, and the output routines all support a way to specify an identifier.
One exception is if you read molecules from an SD file where you want to use one of the SD tag values as the identifier rather than the title line at the top of the record. This can occur with the ChEBI data set:
>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID")
>>> next(reader)
('CHEBI:90', <rdkit.Chem.rdchem.Mol object at 0x1152648f0>)
>>> id, mol = _
>>> id
'CHEBI:90'
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1152648f0>
>>> mol.GetProp("_Name")
''
>>> print(reader.location.record[:34])
b'\n Marvin 01211310252D \n'
>>> print(reader.location.record.decode("utf8")[:200])
Marvin 01211310252D
22 24 0 0 0 0 999 V2000
-2.8644 -0.2905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.8656 -1.1176 0.0000 C 0 0 0 0 0 0 0
Note: the location.record
is a byte
string. The decode("utf8")
step is to make it easier to display by
interpreting the bytes as a utf8-encoded string.
The above used the RDKit-specific way to get the special “_Name”
property and show that it’s the empty string. The location object for
the rdkit_toolkit SDF reader is able to show the raw text for the
current record. In the above I used the location.record
to show that the record indeed has no
title line.
I might want to tie that id directly to the molecule. For example, a lot of people write code which assume that the molecule’s name or title is the identifier, because only ChEBI and a scant handful of other databases use an alternative convention. You can use chemfp to get the appropriate id, then set the correct molecular property.
If I know it’s an RDKit molecule then I could do:
mol.SetProp("_Name", id)
This is not portable. OEChem and Open Babel call this a “title”, and use the molecule’s “GetTitle()” and “SetTitle()” accession methods to get and set it. For those toolkits I would need to do:
mol.SetTitle(id)
As part of chemfp’s limited molecule API, each chemfp toolkit layer
implements a portable helper function named get_id()
to get the toolkit-appropriate “identifier”, and
set_id()
to set it. The following shows an example of
converting the title of a SMILES record to upper-case, and generating
the corresponding canonical SMILES:
>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel", "cdk"):
... T = chemfp.get_toolkit(toolkit_name)
... mol = T.parse_smi("c1ccccc1O phenol")
... T.set_id(mol, T.get_id(mol).upper())
... smiles = T.create_smi(mol)
... print(toolkit_name, "->", repr(smiles))
...
rdkit -> 'Oc1ccccc1 PHENOL\n'
openeye -> 'c1ccc(cc1)O PHENOL\n'
openbabel -> 'Oc1ccccc1 PHENOL\n'
cdk -> 'C1=CC=C(C=C1)O PHENOL\n'
Please note that this could be written more succinctly by passing the
id
directly to the create_string()
toolkit
function, as:
>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel", "cdk"):
... T = chemfp.get_toolkit(toolkit_name)
... id, mol = T.parse_id_and_molecule("c1ccccc1O phenol", "smi")
... smiles = T. create_string(mol, "smi", id=id.upper())
... print(toolkit_name, "->", repr(smiles))
...
rdkit -> 'Oc1ccccc1 PHENOL\n'
openeye -> 'c1ccc(cc1)O PHENOL\n'
openbabel -> 'Oc1ccccc1 PHENOL\n'
cdk -> 'C1=CC=C(C=C1)O PHENOL\n'
Note: I may add support for an optional id_tag, as in:
T.get_id(mol, id_tag="ChEBI id") # Currently not valid chemfp code!
If you think this would be useful, please let me know about your use case.
Finally, if you want the output record as a UTF-8 encoded byte string
rather than a Unicode string then use
create_bytes()
instead of
create_string()
.
Molecule API: Copy a molecule¶
In this section you’ll learn how to make a copy of a native toolkit molecule.
The chemfp file readers may clear and reuse the underlying toolkit molecule. This is a problem if you want to load all of the molecules from a data set into memory:
>>> from chemfp import openeye_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> mols = list(openeye_toolkit.read_molecules_from_string(content, "smi"))
>>> mols
[<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x109776d20> >,
<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x109776d20> >]
>>> mols[0] is mols[1]
True
>>> openeye_toolkit.create_string(mols[0], "smistring")
''
>>> openeye_toolkit.create_string(mols[1], "smistring")
''
You can see that the openeye_toolkit
reader reuses the same
OEGraphMol, and that the molecule is cleared at the end of parsing.
In the future there may be a reader_args parameter to tell the reader
to make a new molecule for each term. Until that possible future
happens, one work-around is to make a copy of the molecule using the
respective chemfp toolkit’s copy_molecule()
function:
>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in openeye_toolkit.read_molecules_from_string(content, "smi")]
>>> mols
[<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10b31e930> >,
<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10b31e6f0> >]
>>> mols[0] is mols[1]
False
>>> T.create_string(mols[0], "smistring")
'C'
>>> T.create_string(mols[1], "smistring")
'O=O'
The various writers may also modify the molecule, for example, by temporarily changing the molecule id or by reperceiving aromaticity. If this is a problem then you can use the copy_molecule() as a way to work around it.
This is definitely a work-around solution because it’s currently
impossible to know if a copy is needed or not. The fail-safe solution
is to always copy, which will lead to extra copies and slower code
when using the rdkit_toolkit
. Other more complicated
workarounds might be faster, but the real solution that I hope to
implement in the future is to specify the requested behavior as a
parameter.
Add fingerprints to an SD file using a toolkit¶
In this section you’ll learn how to add a fingerprint as a tag to the structures in an SD file using a chemistry toolkit.
The FPS and FPB fingerprint file formats store the record id and the fingerprint, but not the original structure. The most common way to tie the structure to a fingerprint is to use an SD file, and store the fingerprint as one of the tag values. (Another is to create a SMILES file variant, also called a CSV file, with the fingerprint as a new column.)
The following will parse an SD file, and for each molecule it will compute the MACCS fingerprints and add the base64-encoded fingerprint to the molecule using the unimaginative tag name “FP”. It will save the results to the file named “example.sdf”, which is equally unimaginative:
import sys
import base64
import chemfp
# RDKit and Open Babel accept a byte string.
# OEChem does not. Always convert to Unicode.
def b64encode(s):
return base64.b64encode(s).decode("ascii")
# Select your toolkit of choice
fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
#fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#fptype = chemfp.get_fingerprint_type("CDK-MACCS")
T = fptype.toolkit
reader_args = {"rdkit.sdf.removeHs": False}
with T.read_molecules("Compound_099000001_099500000.sdf.gz",
reader_args=reader_args) as reader:
with T.open_molecule_writer("example.sdf") as writer:
for mol in reader:
fp = fptype.compute_fingerprint(mol)
T.add_tag(mol, "FP", b64encode(fp))
writer.write_molecule(mol)
This is a very general purpose solution. It’s easy to change the fingerprint type, or switch the input to a SMILES file or other supported structure file format.
What it doesn’t do is preserve all of the details of the input records. It converts the input record into a molecule, and back out to a new record, and the intermediate record doesn’t keep all of the details.
For example, if I use OpenEye-MACCS166 and compare the first record of the original with the first record of the transformed output then the diff comparison is:
2c2
< -OEChem-04292009532D
---
> -OEChem-06182011512D
221a222,224
> > <FP>
> AAAEAAAAMAABwEBOk+GQU9yga24b
>
This says that the second line changed, and three new lines were added at line 221
The second line contains a date stamp, so this isn’t a big change, and the three new lines are the ones I requested. This doesn’t look like much of a change, but that’s because OEChem was used to make the record in the first place. Open Babel and RDKit have their own set of differences from the OEChem output defaults. For example, RDKit will sort the SD tags alphabetically.
I wanted to compare the original OEChem-based PubChem record to the output record from RDKit. I commented/uncommented the fingerprint names to use RDKit instead of OEChem. When I did this originally (since fixed), I noticed that the atom and bond counts line changed.
The first problem I noticed, before I fixed it, is that the atom and bond counts line changed. The original record has 46 atoms and 49 bonds:
46 49 0 1 0 0 0 0 0999 V2000
while the RDKit output said there are only 28 atoms and 31 bonds:
28 31 0 0 1 0 0 0 0 0999 V2000
What happened is that RDKit by default will convert explicit hydrogens to implicit hydrogens as part of the input process, while OEChem does not.
I can disable that in RDKit using the removeHs reader_arg, which is in the code I showed earlier:
reader_args = {"rdkit.sdf.removeHs": False}
With removeHs disabled, the RDKit atom counts match the original atom counts. There are still a few differences in the molblock.
RDKit places a “0” in the obsolete 4th field of the counts line, while OEChem leaves it empty.
RDKit uses the CHG property block and does not include duplicate charge information in the atom line. The PubChem file only stores charge information in the atom line.
RDKit leaves the last three fields empty, while PubChem uses 0. These fields are respectively ‘obsolete’, used for “SSS queries only”, and used for “Reaction, Query”.
That aside, the MACCS fingerprints should be the same, right?
They are not. The RDKit (and Open Babel and CDK) MACCS keys implementations assume that all hydrogens are implicit. If there are explicit hydrogens then they will likely give a different fingerprint. If you run the above code using RDKit, with and without removeHs, you’ll see two different values for FP:
AAAEAAAAMAABxABOk+GwU9zhb24f # RDKit, removeHs=False
AAAEAAAAMAABwABOk2GwUdzhZ24f # RDKit, removeHs=True
I’m left with the unfortunate situation where I can’t preserve the explicit hydrogens without affecting the MACCS fingerprints. I think the right solution is to fix the SMARTS patterns that RDKit and others use (which is a goal of chemfp’s own RDMACCS fingerprints).
Another solution for this is to use the text_toolkit to preserve the input SDF record syntax, and combine it with a chemistry toolkit to get the molecule you want.