Text toolkit examples¶
The text toolkit separates record parsing from chemical parsing. It understands the basic text structure of SDF and SMILES-based files and records, but not chemistry. It’s designed with the following use cases in mind:
add tag data to an input SDF record but keep everything else unchanged. This preserves data which might be lost by converting to then from a chemical toolkit molecule.
synchronize operations between multiple toolkits; For example, consider a hybrid fingerprint using both OEChem and RDKit. The individual RDKit and OEChem SDF readers may get out of synch when on toolkit can’t parse a record which the other can. In that case, use the text toolkit to read the records then pass the record to the chemistry toolkit.
extract tags from an SD file. Chemfp’s sdf2fps uses the text toolkit to get the id and the tag value which contains the fingerprint.
The text toolkit implements the chemistry toolkit API, except that instead of real molecule objects it uses a thin wrapper around the text for each wrapper. This chapter uses many of the concepts developed in the chapter on Toolkit API examples.
Toolkits may modify the molecular structure¶
In this section you’ll learn that a chemistry toolkit might change details of a structure record so the input record and output record have some differences, even though the molecular “essence” is preserved. This is meant as an example for why you might not want to work through a chemistry toolkit molecule for everything.
The section Add fingerprints to an SD file using a toolkit gave an example of using a toolkit to read an SD file, compute a MACCS fingerprint, add the fingerprint as a new SD tag, and save the result to a new SD file. This is a very common task.
A problem is that toolkits can apply various normalizations, like aromaticity perception, which change atom and bond aromaticity assignments. RDKit by default will also convert explicit hydrogens into implicit hydrogens. In that section, the input record had 46 atoms while RDKit generated an output record with 28 atoms. RDKit may also ‘sanitize’ the structures further (for example, convert ‘neutral 5 coordinate Ns with double bonds to Os to the zwitterionic form’).
While it’s possible to configure RDKit to keep implicit hydrogens, the RDKit MACCS fingerprinter assumes there are no explicit hydrogens. You would need to make a copy of the molecule, remove the explicit hydrogens yourself, generate the fingerprint, and then add the fingerprint to the molecule which still has the explicit hydrogens.
Bear in mind that the number of explicit atoms and bonds is based on the molecular graph model, which is only one possible representation for the actual chemical molecule. While I said there was a semantic change, the 46 atom structure and the 28 atom structure are really the same structure, just at different levels of conceptualization.
Toolkits may modify SDF syntax¶
In this section you’ll see that passing a structure file through a chemistry toolkit and back to the same format will likely make syntax changes to the record. While not as significant as the previous section, it may help persuade you that there are cases where you want to work with the original record as text rather than as a molecule.
You will need Compound_099000001_099500000.sdf.gz from PubChem.
I’ll read an SD file to get the first record as a toolkit molecule, save the molecule to SDF format, and compare the original record with the new one. This is called a round-trip test. Will there be differences?
import chemfp
# Select your toolkit of choice
T = chemfp.get_toolkit("openeye")
#T = chemfp.get_toolkit("rdkit")
#T = chemfp.get_toolkit("openbabel")
reader_args = {"rdkit.*.removeHs": False}
with T.read_molecules("Compound_099000001_099500000.sdf.gz",
reader_args=reader_args) as reader:
with T.open_molecule_writer("example.sdf") as writer:
for mol in reader:
writer.write_molecule(mol)
break # only process the first molecule
If I use the “openeye” toolkit and compare its output to the first record of the input then the difference is trivial:
2c2
< -OEChem-04292009532D
---
> -OEChem-06182012582D
This difference is shown in the diff utility’s default
format. The “2c2” means there was a change in line 2, and the changed
line is also on line 2. The “<” indicates the line in the first file
(in this case the original PubChem file) and the “>” indicates the
line in the second file (in this case “example.sdf”). The “---
” is
to make it easier for humans to see break between the two files.
But what does that line mean? The “CTfile”
(“connection table file”) spec from MDL, err, I mean Accelry, err, I
mean Symyx, err, I mean BIOVIA, gives the full details. The first two
characters (both blank here) are the user’s initials, the next 8
characters (OpenEye uses “-
” to pad out “OEChem”) are the program
name.
The next six character are the date, followed by 4 characters for the time. The PubChem record was created on 29 April 2020 at 09:33 while I did the transformation on 18 June 2020 at 12:58. The last two characters are the dimensionality; in this case the structure contains 2D coordinates.
PubChem used OEChem to make the file in the first place, so it’s not too suprising that there weren’t any differences. What about Open Babel? I changed the toolkit to “openbabel” and re-did the comparison. The first few lines of the diff are:
2c2
< -OEChem-04292009532D
---
> OpenBabel06182013012D
4c4
< 46 49 0 1 0 0 0 0 0999 V2000
---
> 46 49 0 0 1 0 0 0 0 0999 V2000
The 2c2 change you know already, and you can see it was a few minutes between when I ran the OEChem code and the Open Babel code.
The change to line 4 is meaningless. If you look closely you’ll see that OEChem has a blank in column 12 where Open Babel has a “0”. The specification say that this field is obsolete, so I think you can do whatever you want there.
The next few lines are:
65c65
< 8 9 1 6 0 0 0
---
> 8 9 1 0 0 0 0
67c67
< 8 29 1 0 0 0 0
---
> 8 29 1 1 0 0 0
This says that OEChem interprets the bond between atoms 8 and 9 as “6” = “down” stereochemistry, while Open Babel says it’s not stereo. On the other hand, OEChem interprets the bond bond atoms 8 and 29 as having no stereochemistry while Open Babel says it has “1” = “up” stereochemistry. Looks to me like two valid interpretations of the same thing.
The rest of the differences are trivial and semantically meaningless: Open Babel uses two spaces between the “>” and “<” of a data header line, while OEChem uses one space:
101c101
< > <PUBCHEM_COMPOUND_CID>
---
> > <PUBCHEM_COMPOUND_CID>
104c104
Finally, I’ll use RDKit for the conversion. By default RDKit removes hydrogens, which would leave the result with 15 atoms. Unlike Open Babel, that action is configurable. I told RDKit to never remove hydrogens for any of its supported formats, via the reader_args:
reader_args = {"rdkit.*.removeHs": False}
(I didn’t actually need the “rdkit.*.” namespace prefix, but the “rdkit” helps as a reminder that this is an RDKit-specific option.)
There are the familiar changes in the second and fouth lines:
2c2
< -OEChem-04292009532D
---
> RDKit 2D
4c4
< 46 49 0 1 0 0 0 0 0999 V2000
---
> 46 49 0 0 1 0 0 0 0 0999 V2000
RDKit doesn’t include the timestamp so leaves that fields blank. (Then
again, just how useful is the timestamp? On the third hand, the chemfp
fingerprint formats include a timestamp as part of the metadata, so
it’s odd that I question having it in another format. On the fourth
hand, OEChem supports the SuppressTimestamps
option to disable
including the timestamp.)
While I love knowing these sorts of details, none of these (except for the explicit hydrogens) affect the semantic interpretation. Still, I can think of cases where you want to preserve the original syntax, like if you have fragile code which expects a “0” at a certain field and will crash if there’s a blank.
The text toolkit “molecules”¶
In this section you’ll learn about the molecule-like object used by
the text_toolkit
.
The text_toolkit implements the standard toolkit API, which means it
reads and writes “molecules”. Remember that it isn’t really a chemical
molecule but more like a thin layer around a molecule
record. Internally these are subclasses of a TextRecord
,
though I’ll often refer to them as “text molecules” to distinguish
them from the the actual record as a text string.
Every text molecule has an id
attribute, which may
be None if there is no identifier, and a record
attribute containing the actual record as a string:
>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
>>> mol
SmiRecord(id='benzene', record=b'c1ccccc1O benzene', smiles='c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> mol.id # a Unicode string
'benzene'
>>> mol.record # a byte string
b'c1ccccc1O benzene'
>>> text_toolkit.create_string(mol, "smistring")
'c1ccccc1O'
>>> text_toolkit.create_string(mol, "smi")
'c1ccccc1O benzene\n'
>>> text_toolkit.create_bytes(mol, "smistring")
b'c1ccccc1O'
>>> text_toolkit.create_bytes(mol, "smistring.zlib")
b'x\x9cK6L\x06\x01C\x7f\x00\x0fh\x03\x04'
>>>
>>> sdf_record = (
... 'methane\n' +
... '\n' +
... '\n' +
... ' 1 0 0 0 0 0 0 0 0 0999 V2000\n' +
... ' 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n' +
... 'M END\n' +
... '$$$$\n')
>>>
>>> sdf_mol = text_toolkit.parse_molecule(sdf_record, "sdf")
>>> sdf_mol
SDFRecord(id_bytes=b'methane'(id='methane'),
record=b'methane\n\n\n 1 0 0 0 0 0 0 0 0 0999 V2000\n 0.0 ...',
encoding='utf8', encoding_errors='strict')
>>> sdf_mol.id
'methane'
>>> sdf_mol.record[-20:]
b'0 0 0\nM END\n$$$$\n'
The record is always uncompressed.
Each of the SMILES-based records has its own unique class:
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
SmiRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "can")
CanRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usm")
UsmRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smistring")
SmiStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "canstring")
CanStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usmstring")
UsmStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')
and for SMILES records you can access the SMILES directly through the
smiles
attribute:
>>> text_mol = text_toolkit.parse_molecule("C methane", "smistring")
>>> text_mol.smiles
'C'
Each text molecule also has a record_format
attribute, which is the format name for the record.
>>> text_mol.record_format
'smistring'
>>> sdf_mol.record_format
'sdf'
The record_format
values are “smi”, “can”, …,
“usmstring” for the SMILES formats or “sdf” for a file in SDF
format. The record_format
will never have a
compression suffix.
Unlike the chemistry-backed toolkits, the text_toolkit has no real understanding of chemistry, only a limited knowledge of the format structure. It will parse an generate garbage:
>>> text_mol = text_toolkit.parse_molecule("garbage", "smi")
>>> text_toolkit.create_string(text_mol, "smi", id="and trash",
... writer_args={"delimiter": "tab"})
'garbage\tand trash\n'
The encoding and encoding_errors parameters describe the character encoding of the record bytes, and how to handle errors in converting to or from that encoding. For details see the section Unicode and other character encoding.
The text toolkit implements the toolkit API¶
In this section you’ll learn that the text toolkit is a pretty
complete implementation of chemfp’s toolkit API
.
The text toolkit implements all of the standard toolkit API, except that it doesn’t know how to convert between SMILES and SDF format. Here are some examples:
>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("C", "smistring")
>>> text_toolkit.get_id(mol) is None
True
>>> text_toolkit.set_id(mol, u"methane")
>>> text_toolkit.get_id(mol)
'methane'
>>> text_toolkit.create_string(mol, "smi")
'C methane\n'
>>> content = "C methane\nO=O molecular oxygen\n"
>>> with text_toolkit.read_ids_and_molecules_from_string(
... content, "smi") as reader:
... for id, mol in reader:
... print("#%d %r" % (reader.location.recno, id))
...
#1 'methane'
#2 'molecular oxygen'
>>>
>>> writer = text_toolkit.open_molecule_writer("light.sdf")
>>> for mol in text_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz"):
... mass = text_toolkit.get_tag(mol, "PUBCHEM_EXACT_MASS")
... mass = float(mass)
... if mass > 100.0:
... continue
... cid = text_toolkit.get_tag(mol, "PUBCHEM_COMPOUND_CID")
... print("Found", cid, mass)
... writer.write_molecule(mol)
...
Found 99109812 99.068414
Found 99109899 97.052764
Found 99118867 98.073165
Found 99118868 98.073165
Found 99123566 97.089149
Found 99151119 84.057515
Found 99151121 84.057515
Found 99162605 98.073165
Found 99162607 98.073165
>>> writer.close()
>>> for lineno, line in enumerate(open("light.sdf"), 1):
... print(repr(line))
... if lineno == 4:
... break
...
'99109812\n'
' -OEChem-04292009532D\n'
'\n'
' 16 17 0 1 0 0 0 0 0999 V2000\n'
What you can’t do with the text_toolkit is convert from a SMILES-based format to SDF, or vice-versa. If you try you’ll either get an exception or a meaningless molecule representation.
While you can seemingly convert between the SMILES formats, the text toolkit doesn’t actually modify the SMILES term, so an input of “[238U]” will have a “canstring” (non-isomeric SMILES) of “[238U]”:
>>> U = text_toolkit.parse_molecule("[235U]", "smistring")
>>> text_toolkit.create_string(U, "canstring")
'[235U]'
I don’t know if I should make this more strict in the future, and prohibit conversion between “smi”, “can”, and “usm” formats.
Synchronizing readers from different toolkits through the text toolkit¶
In this section you’ll learn how to keep two different toolkit parsers synchronized by using the text toolkit to parse the records, then pass the record over to each toolkit to convert it to a molecule.
A structure file may have a couple of records which cannot be parsed
by a toolkit, usually due to odd chemistry definitions. It’s usually
fine to skip those records, which is the purpose of the
errors="ignore"
setting. (See
Handling errors when reading molecules from a string for more information about
the errors parameter.)
Consider the following SMILES file with three lines:
% cat strange.smi
C methane
C--C ethane not for RDKit
CC ethane for everyone
The first and last are valid SMILES, but “C--C
” is
invalid. However, Open Babel will accept it, and OEChem will accept it
because the default flavor does not add the “Strict” flavor flag. (See
OpenEye-specific SMILES reader_args and writer_args for more information about
OEChem flavors). As a result:
>>> from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit
>>> for id, mol in openeye_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
... print("openeye found", repr(id))
...
openeye found 'methane'
openeye found 'ethane not for RDKit'
openeye found 'ethane for everyone'
>>>
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
... print("rdkit found", repr(id))
...
rdkit found 'methane'
[15:12:30] SMILES Parse Error: syntax error while parsing: C--C
[15:12:30] SMILES Parse Error: Failed parsing SMILES 'C--C' for input: 'C--C'
rdkit found 'ethane for everyone'
>>>
>>> for id, mol in openbabel_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
... print("openbabel found", repr(id))
...
openbabel found 'methane'
openbabel found 'ethane not for RDKit'
openbabel found 'ethane for everyone'
Sometime you want to work with multiple toolkits using the same input molecule. For example, you might want to compute a hybrid fingerprint, or make a model prediction where the descriptors come from different toolkits.
To do that, use the text_toolkit.read_ids_and_molecules()
to
read each record as a text molecule, and pass the actual record to the
toolkit’s parse_molecule()
to get a molecule. Because
I specifed the “ignore” error handler, the molecule will be None if
the record could not be parsed. (See Specify alternate error behavior
for more details.):
from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit
from chemfp import text_toolkit
for id, text_mol in text_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
if openeye_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
print("openeye parsed", repr(id))
else:
print("openeye could not parse", repr(id))
if rdkit_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
print("rdkit parsed", repr(id))
else:
print("rdkit could not parse", repr(id))
if openbabel_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
print("openbabel parsed", repr(id))
else:
print("openbabel could not parse", repr(id))
The output from running the above is:
openeye parsed 'methane'
rdkit parsed 'methane'
openbabel parsed 'methane'
openeye parsed 'ethane not for RDKit'
[15:13:45] SMILES Parse Error: syntax error while parsing: C--C
[15:13:45] SMILES Parse Error: Failed parsing SMILES 'C--C' for input: 'C--C'
rdkit could not parse 'ethane not for RDKit'
openbabel parsed 'ethane not for RDKit'
openeye parsed 'ethane for everyone'
rdkit parsed 'ethane for everyone'
openbabel parsed 'ethane for everyone'
The above works, but there’s a lot of duplicate code, I don’t like the
layout for the output, and there’s bit of extra overhead to
re-interpret the parse_molecule()
for each call. I’ll make a
space-delimited file as output, and use the toolkit’s
make_id_and_molecule_parser()
to create a specialized
parser for each available toolkit:
import chemfp
from chemfp import text_toolkit
reader = text_toolkit.read_ids_and_molecules("strange.smi")
format = reader.metadata.record_format
column_headers = []
parsers = []
for toolkit_name in ("openeye", "rdkit", "openbabel"):
column_headers.append(toolkit_name)
try:
toolkit = chemfp.get_toolkit(toolkit_name)
except ValueError:
parsers.append(None)
else:
parser = toolkit.make_id_and_molecule_parser(format, errors="ignore")
parsers.append(parser)
column_headers.append("ID")
print(*column_headers, sep="\t") # print the header
for id, text_mol in reader:
columns = []
for parser in parsers:
if parser is None:
columns.append("N/A")
else:
id, mol = parser(text_mol.record)
if mol is not None:
columns.append("Yes")
else:
columns.append("No")
columns.append(id)
print(*columns, sep="\t")
This writes a tab-delimited file to stdout, ready for import into any spreadsheet program:
openeye rdkit openbabel ID
Yes Yes Yes methane
[15:15:21] SMILES Parse Error: syntax error while parsing: C--C
[15:15:21] SMILES Parse Error: Failed parsing SMILES 'C--C' for input: 'C--C'
Yes No Yes ethane not for RDKit
Yes Yes Yes ethane for everyone
(There will also be error messages from RDKit sent to stderr.)
Add multiple toolkit fingerprints to an SD file¶
In this section you’ll learn how to use multiple toolkits to generate fingerprints for each molecule in an SD file, and add the fingerprints results back to the record as new SD tags.
In Add fingerprints to an SD file using a toolkit you learned how to use a toolkit to read a file as molecules, compute a fingerprint for each molecule, and add the fingerprint to the molecule as an SD tag, and save the result to a new SD file. The processing pipeline converted the input to a toolkit molecule and out again, and in doing so changed other parts of the record besides the new SD tag.
Sometimes you want to preserve the input as much as you can. For that case you can use the text reader to get text molecules, pass each text molecule’s record that to the toolkit to compute the fingerprint, add the new fingerprint as a tag for the text molecule, and save the result to a file.
I’ll do that one better; I’ll generate fingerprints using multiple toolkit and add all of them to the output file. Here’s an example of what the end of a new record will look like. Note: although the fingerprints are actually on one line, I’ve folded the long fingerprints across multiple lines so it doesn’t overflow this page.
> <rdkit512>
3ffef7cefffffffefbfedffbbdffefffbfffffffffbbbffffffffffdfddffdfffffdf7fffeffffe7
feeffffffffffbffef9ffffffd7fffeff6deffff3feffdff
> <rdkit1024>
3ffca78efffdfeecdbfedefbbd97657f8dfbf0f35aba3fff2fff7ffdaddffdfffff9f1befafe3fa7
7eeefbdff7f78bfbcf97fb37996eef67e25ef37f1bcd7db50b76f242efff93baf9d6533b91ebefef
bbdd7ffeefb3bf9bf3abca45d55f7da79df5e6fb96dfc647eccfef67dfbbfb36ad9fecebf53f9acb
f6ca4efc3eeafdff
> <oecircular4>
00000000000000080000000001080000000008000000080800000000000000000000000000000000
00000000100000000000000000000000000400000000000080000000000000000000000008002000
00000300000000000040000200100000000000000000000000000000008000010002000000054800
00000000000800000000040000000000200000002800000000000000000000000000000000000000
00000000400000000000000080000000000000000000000000000010001000000000400000000000
00000000008080000000000080000000000010004000000100000020010002200000000001000000
00000000000002021000002000000000000000100000004000000000000000000000000000200080
00000000000000000000000000000000000000000000000000000000000000001080000000100000
10000000000080000000000000000000000000000400000080000000000000000000000000000000
00200000000000000000040000100000000000001000010000000000000200000080000000000001
00000804000008000000000000000000000000000004004000000000000000000000008000000001
20000010000000000100000000200000010000040000000000000010000000800000000000000000
0000000000000000004002000000000000000000000000000000000400000000
> <obecfp2>
00000000001000000000000000000000000000000000000000000000000000000000000000000000
00001000002000000000000000000000000000000000000000000000000000000000000000000000
00020000000000000000000000000000000000400000000000000000000000000000000000000000
00000000800000000000000001000000000000000000000000000000000200000000000000000000
00004000000000000800000000000000000000000000000000000000000000000004000020000000
00000000000000000000000000000000000000000000028000000000000000000000000000000000
00000000000000000000000000000000000000200000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000040000000000000000040
00000080000000000000000000000004000800000000000000000000000000000000000000000000
00000000000002000000000000000000000000800000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000004000000000020000000000000080000
00000002000000000000000000000080000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000800001000000000
$$$$
I’ll break it down into stages. The first is some preamble code to import the modules and configure the input and output files:
import chemfp
# I'll use chemfp's text-based SD parser, so the output SD records
# will be identical to the input records, except to append the new
# tags at the end of each record.
from chemfp import text_toolkit, bitops
input_filename = "Compound_099000001_099500000.sdf.gz"
output_filename = "output.sdf.gz"
Next is to get the right SDF parser
(a
function which converts an SDF record into a identifier and a native toolkit molecule)
and fingerprinter
(a
function which converts a toolkit molecule into a fingerprint) for
each fingerprint type.
# The list of tag names and the corresponding fingerprint types.
wanted_fingerprint_types = (
("rdkit512", "RDKit-Fingerprint fpSize=512"),
("rdkit1024", "RDKit-Fingerprint fpSize=1024"),
("oecircular4", "OpenEye-Circular maxradius=4"),
("obecfp2", "OpenBabel-ECFP2"),
)
build_data = [] # I'll use this to build the fingerprint data.
toolkit_sdf_record_parsers = {} # I'll use this to convert an SD record into a molecule.
for output_tag, fingerprint_type_string in wanted_fingerprint_types:
# First, get the corresponding fingerprint type.
fingerprint_type = chemfp.get_fingerprint_type(fingerprint_type_string)
# Figure out which toolkit to use to parse the SD records.
toolkit = fingerprint_type.toolkit
# For each unique toolkit, get a function that turns an SD record into a molecule.
# (If multiple fingerprints use the same toolkit then I only
# need to parse it once.)
if toolkit.name not in toolkit_sdf_record_parsers:
# The "ignore" means to return None on error, rather than raise an exception.
toolkit_sdf_record_parsers[toolkit.name] = toolkit.make_id_and_molecule_parser("sdf", errors="ignore")
# Get a function which turns a molecule into a fingerprint.
fingerprinter = fingerprint_type.make_fingerprinter()
# Store this information for record processing.
build_data.append( (output_tag, toolkit.name, fingerprinter) )
Finally, use the text toolkit to read text molecules for each record, then use the SDF parser to get the id and molecule from the record text, then the fingerprinter to get the fingerprint from the molecule:
# Use the text toolkit to read and write SDF records.
with text_toolkit.open_molecule_writer(output_filename) as writer:
for text_mol in text_toolkit.read_molecules(input_filename):
# The text "molecule" .record is the actual text.
record = text_mol.record
# Make the fingerprints for each record and append the tag.
# For extra performance, cache parsed molecules for future use.
toolkit_mols = {}
for output_tag, toolkit_name, fingerprinter in build_data:
# There's no need to reparse the record if I've seen it before.
if toolkit_name in toolkit_mols:
toolkit_mol = toolkit_mols[toolkit_name]
else:
# Parse the record and save the molecule for later.
toolkit_id, toolkit_mol = toolkit_sdf_record_parsers[toolkit_name](record)
toolkit_mols[toolkit_name] = toolkit_mol
if toolkit_mol is None:
# There's no molecule, so no fingerprint. Save the empty string.
text_mol.add_tag(output_tag, "")
else:
# Make a fingeprint and save it to the tag as a hex-encoded string.
fp = fingerprinter(toolkit_mol)
text_mol.add_tag(output_tag, bitops.hex_encode(fp))
# Write the text molecule to the output stream.
writer.write_molecule(text_mol)
Text toolkit and SDF files¶
In this section you’ll learn about the specialized SDF reader API to read SDF records and tag values directly instead of through a text record.
The text toolkit support for the toolkit API lets you use the same
code for SDF and SMILES, and switch between text-based and
molecule-based parsers. Genericness comes at a cost. The
TextRecord
class is a wrapper around the actual record, so
at the least there is some overhead for creating a wrapper for each
record.
The text toolkit has special support for reading SDF records as raw byte strings, which are not wrapped in any object. There several SDF reader variations depending on if you want to read from a file or a string, and if you want to read the record, the (id, record) pair, or an (id, tag value) pair. These functions are:
read_sdf_records()
- iterate over the records in an SD fileread_sdf_records_from_string()
- the same, but from a stringread_sdf_ids_and_records()
- iterate over the (id, record string) pairs from an SD fileread_sdf_ids_and_records_from_string()
- the same, but from a stringread_sdf_ids_and_values()
- iterator over the (id, value) pairs from an SD fileread_sdf_ids_and_values_from_string()
- the same, but from a string
(Note: while I write this as (id, value), those are just labels. By default it returns (SD title, SD title) pairs, or you can specify an alternate id_tag and value_tag to get the pairs you want.)
There are also special functions to work with the tag data and title of an SDF record, which take the record string as input:
get_sdf_tag()
- get a named tag from an SDF recordadd_sdf_tag()
- return a new SDF record with the new tag and value at the end of the tag blockget_sdf_tag_pairs()
- return a list of (tag name, tag value) pairsget_sdf_id()
- return the first line of the SDF recordset_sdf_id()
- return a new SDF record with the new title line
The next few sections will cover some examples of how to use these specialized functions.
Read id and tag value pairs from an SD file¶
In this section you’ll learn how read the (id, tag value) for each record in an SD file using a specialized SDF reader. You will need Compound_099000001_099500000.sdf.gz from PubChem.
The specialized SDF readers are faster than the more generic text_toolkit support for the toolkit API. As an example, I’ll extract the identifer and molecular weight field from a PubChem file using the (slower) chemfp toolkit API:
from chemfp import text_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with text_toolkit.read_ids_and_molecules(filename) as reader:
for id, text_mol in reader:
mw = text_mol.get_tag("PUBCHEM_EXACT_MASS")
print(id, mw)
Next I’ll extract it using the (faster)
read_sdf_ids_and_values()
function, which returns an iterator
of the (id, tag value) pairs. Just like with a toolkit’s
read_ids_and_molecules()
, by default the id is the
title line of the SD record, or I can use the id_tag parameter to
get it from one of the SD tags. The value_tag has the same meaning;
by default the value is the record’s title, or I can specify an
alternate tag name containing the value to use:
from chemfp import text_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with text_toolkit.read_sdf_ids_and_values(filename, value_tag="PUBCHEM_EXACT_MASS") as reader:
for id, mw in reader:
print(id, mw)
Both of these generate output starting with:
99000039 374.13789
99000230 449.162057
99002251 335.126991
99003537 374.210661
99003538 374.210661
99005028 315.183444
99005031 315.183444
My timings using the larger file
Compound_145500001_146000000.sdf.gz
show that the first, generic
implementation takes 14.0 seconds while the second, specialized
implementation takes 10.3 seconds, which is about 25% faster, which
would save a lot of time when parsing all of PubChem. (The difference
is even larger - nearly 50% faster! - without the gzip overhead.)
That’s why the sdf2fps command-line tool uses this
function to extract the ids and fingerprint values from PubChem files.
Extract the id and atom and bond counts from an SD file¶
In this section you’ll use a specialized SDF reader iterate over the records of an SD file. You will need Compound_099000001_099500000.sdf.gz from PubChem.
The “records” returned by read_sdf_records()
,
read_sdf_records_from_string()
,
read_sdf_ids_and_records()
, and
read_sdf_ids_and_records_from_string()
are the actual record
content as a string, and not wrapped in a TextRecord or other class.
For example, the following will read each record from an SD file and use a regular expression to extract the title line, the number of atoms from the first 3 characters of line 4, and the number of bonds as the second 3 characters of line 4:
from chemfp import text_toolkit
import re
pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")
filename = "Compound_099000001_099500000.sdf.gz"
for record in text_toolkit.read_sdf_records(filename):
m = pat.match(record)
id = m.group(1).decode("utf8")
num_atoms = int(m.group(2))
num_bonds = int(m.group(3))
print(id, num_atoms, num_bonds)
The output starts:
99000039 46 49
99000230 58 60
99001517 43 44
99002251 42 43
99003537 54 57
99003538 54 57
99005028 48 49
99005031 48 49
(Bear in mind that there may also be implicit hydrogens, so unless you know that all hydrogens are explicit or implicit, these numbers may only be roughly useful.)
Records are byte strings¶
The example code, while short, is still a bit tricky. The reader returns the SD records as byte strings, not Unicode strings. Why? First and foremost, using Python to read bytes from a file is significantly faster than reading Unicode. If all you care about is reading a couple of fields from the record then it’s faster to work with bytes and convert only those fields.
Second, this is a low-level API meant to give the actual byte representation of the data. Among other things, you should be able to know exactly where the record is located in the file. You can even do things like handle mixed encodings, where one tag value is UTF-8 encoded and another is Latin-1 encoded and cannot be read as a value UTF-8.
Python 3 makes a strong distinction between a byte string and a Unicode string. For Python 3, because the record a byte string, you’ll have to use a byte-based regular expression to parse it, as in:
pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")
You’ll also have to convert the title bytes to Unicode if you want to print the result, as in:
id = m.group(1).decode("utf8")
Thankfully, int() knows how to read the ASCII digits from a byte string, so I didn’t have to do extra work there.
SDF-specific parser parameters¶
In this section you’ll learn that the specialized SDF readers support the standard errors and location , and have a few special parameters of their own. You will need Compound_099000001_099500000.sdf.gz from PubChem.
All six of the read_sdf_*
functions support the same errors and
location parameters as the standard toolkit API, with the same
meaning. For example, the following shows where each record is located
in the uncompressed file:
from chemfp import text_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
with text_toolkit.read_sdf_ids_and_records(filename) as reader:
loc = reader.location
for id, record in reader:
start_byte, end_byte = loc.offsets
print(f"{id} at line {loc.lineno} (bytes {start_byte}-{end_byte})")
The output starts:
99000039 at line 1 (bytes 0-6715)
99000230 at line 223 (bytes 6715-14570)
99001517 at line 462 (bytes 14570-20699)
99002251 at line 660 (bytes 20699-26832)
99003537 at line 866 (bytes 26832-34262)
99003538 at line 1107 (bytes 34262-41691)
See Handling errors when reading molecules from a string for more information
about the errors parameter, and Location information: record position and content for
a description of the how to use a Location
to the record’s
first line number and start/end offsets in the file.
The six functions do not have a format option, because the format
must be “sdf” or “sdf.gz”. Instead, there is a compression
parameter. The default of None selects the compression type based on
the filename, if the filename is available, or assumes the input is
uncompressed. Use “gz” if the input is gzip’ed, “zst” if the input use
Zstandard compression (and the zstandard
Python package is
available) and “none” or “” if the input is uncompressed.
The block_size is a tunable parameter, with a default value of 320 KB. The underlying reader reads a block of text then tries to extract records. When it gets to the end of a block, it reads a new block, and prepends the remaining part of the old block to the new one before looking for new records.
For performance reasons, the block_size should be several times
larger than the largest record. During error recovery, the reader will
read up to 320 KB or 5*block_size
, whichever is larger, in order
to find the next “$$$$” line and resynchronize.
Working with SD records as strings¶
In this section you’ll learn about the helper functions to work with SD record id and tag data when the SD record is a string. You will need Compound_099000001_099500000.sdf.gz from PubChem.
I’ll use one of the specialized SD file readers,
read_sdf_records()
, to get the first record from an SD file:
>>> from chemfp import text_toolkit
>>> record = next(text_toolkit.read_sdf_records("Compound_099000001_099500000.sdf.gz"))
>>> print(record[:73])
b'99000039\n -OEChem-04292009532D\n\n 46 49 0 1 0 0 0 0 0999 V2000\n'
>>> print(record.decode("utf8")[:110])
99000039
-OEChem-04292009532D
46 49 0 1 0 0 0 0 0999 V2000
7.8451 3.0179 0.0000 O 0
I can use get_sdf_tag()
and get_sdf_tag_pairs()
to get
information about the tags in the record:
>>> for tag_name, tag_value in text_toolkit.get_sdf_tag_pairs(record):
... print(tag_name, "=", repr(tag_value[:40]))
...
b'PUBCHEM_COMPOUND_CID' = b'99000039'
b'PUBCHEM_COMPOUND_CANONICALIZED' = b'1'
b'PUBCHEM_CACTVS_COMPLEXITY' = b'611'
b'PUBCHEM_CACTVS_HBOND_ACCEPTOR' = b'4'
...
b'PUBCHEM_CACTVS_TAUTO_COUNT' = b'-1'
b'PUBCHEM_COORDINATE_TYPE' = b'1\n5\n255'
b'PUBCHEM_BONDANNOTATIONS' = b'12 13 8\n12 17 8\n13 18 8\n16 19 8\n'
>>> text_toolkit.get_sdf_tag(record, "PUBCHEM_IUPAC_OPENEYE_NAME")
u'2-(2-hydroxyethylsulfanylmethyl)-4-nitro-phenol'
or use add_sdf_tag()
to create a new record with a given tag and
value added to the end of the tag block:
>>> print(record[-210:].decode("utf8"))
> <PUBCHEM_BONDANNOTATIONS>
12 13 8
12 17 8
13 18 8
16 19 8
16 23 8
17 20 8
18 21 8
19 22 8
19 24 8
20 21 8
22 25 8
23 26 8
24 27 8
25 26 8
27 28 8
7 22 8
7 28 8
8 9 6
$$$$
>>> new_record = text_toolkit.add_sdf_tag(record, b"VOLUME", b"123.45")
>>> print(new_record[-229:].decode("utf8"))
> <PUBCHEM_BONDANNOTATIONS>
12 13 8
12 17 8
13 18 8
16 19 8
16 23 8
17 20 8
18 21 8
19 22 8
19 24 8
20 21 8
22 25 8
23 26 8
24 27 8
25 26 8
27 28 8
7 22 8
7 28 8
8 9 6
> <VOLUME>
123.45
$$$$
I can also get the title line of the SD record using get_sdf_id()
:
>>> text_toolkit.get_sdf_id(record)
b'99000039'
or create a new string which is the old string with the title line replaced by a new value:
>>> new_record = text_toolkit.set_sdf_id(record, b"987ZYX")
>>> text_toolkit.get_sdf_id(new_record)
b'987ZYX'
Note that I used byte strings, like b"VOLUME"
and
b"987ZYX"
. In general the values must be of the same string type
as the record. On the flip side, if you have a Unicode record then you
must pass in Unicode strings as values:
>>> unicode_record = record.decode("utf8")
>>> new_record = text_toolkit.set_sdf_id(unicode_record, u"Hello")
>>> new_record[:6]
'Hello\n'
Unicode and other character encoding¶
In this section you’ll learn a bit about how the text toolkit deals with different character encodings. This is a hard topic and I won’t cover it in full details. If you have a problem with Unicode encodings (and hopefuly a support contract) then contact me and I’ll help that way.
The SDF format is 8-bit clean. The specification itself uses ASCII but fields like the title, the tag name, and the tag value can contain nearly any byte value. (Some values like newline and ‘<’ and ‘>’ in the tag name, have special meaning and must not be used.)
Unfortunately, different software handle those non-ASCII values differently. An older Unix system might use the Latin-1 character set, which is able to handle many European and some non-European languages, but doesn’t have the Euro currency symbol. Microsoft Windows code page 1252 is effectively a superset of Latin-1, with the Euro symbol and a several other additional symbols.
There are of course many other symbols. The consensus for new systems is to use UTF-8 encoded Unicode, which is compatible with 8-bit clean ASCII and can handle most of the world’s languages, plus a large number of symbols. This encoding may use one, two, or more bytes to represent each symbol.
The Python3 bindings of OpenEye, RDKit, and Open Babel’s have all decided to interpret SD files as UTF-8 encoded. This consensus is great … so long as your files are also compatible with UTF-8. But what if they aren’t? What if you have to read Latin-1 encoded file, or worse, a file where different fields have multiple encodings?
To demonstrate the problem, I’ll construct a problematic file for β-methylphenethylamine, with an experimental melting point of 140-142°C, stored in a Latin-1 encoded SD file. For now I’ll use use ‘Beta’ for the name, and ‘DEGREE’ for the temperature, as placeholders for the two non-ASCII characters.
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("NCC(c1ccccc1)C", "smi")
>>> T.set_id(mol, "Beta-methylphenethylamine")
>>> T.add_tag(mol, "MP", "140-142DEGREEC")
>>> unicode_record = T.create_string(mol, "sdf")
>>> print(unicode_record)
Beta-methylphenethylamine
RDKit
10 10 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 2 0
5 6 1 0
6 7 2 0
7 8 1 0
8 9 2 0
3 10 1 0
9 4 1 0
M END
> <MP>
140-142DEGREEC
$$$$
Next, I’ll replace the ‘DEGREE’ with the corresponding Unicode characters. (I’ll use the long Unicode name to be explicit.)
>>> unicode_record = unicode_record.replace(u"DEGREE", u"\N{DEGREE SIGN}")
>>> print(unicode_record)
Beta-methylphenethylamine
RDKit
10 10 0 0 0 0 0 0 0 0999 V2000
....
M END
> <MP>
140-142°C
$$$$
Finally, I’ll save it to the file “latin1.sdf”, using the Latin-1 encoding:
>>> open("latin1.sdf", "wb").write(unicode_record.encode("latin1"))
948
(The “948” indicates that 948 bytes were written to the file.)
This is not valid UTF-8. In my terminal, the MP tag value looks like:
> <MP>
140-142�C
where the “�” is the special symbol for REPLACEMENT CHARACTER, meaning that the actual character cannot be shown.
What happens if I read the file using each of the native toolkit APIs? First, OEChem:
>>> from openeye.oechem import *
>>> ifs = oemolistream("latin1.sdf")
>>> mol = OEGraphMol()
>>> OEReadMolecule(ifs, mol)
True
>>> OEGetSDData(mol, "MP") # OEChem on Python 2.7
'140-142\xb0C'
>>> OEGetSDData(mol, "MP") # OEChem on Python 3.8
'140-142\udcb0C'
OEGetSDData() returns a text/Unicode strings, but the byte “\xb0” is not a valid UTF-8 encoding. Instead, OEChem uses the Unicode codepoint “\udcb0”. This is a surrogate for the actual character, and something I don’t fully understand. Various sources say this is a UTF-16 behavior which isn’t correct UTF-8. Python doesn’t like it:
>>> print('140-142\udcb0C')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb0' in position 7: surrogates not allowed
Next, Open Babel:
>>> from openbabel import openbabel as ob
>>> conv = ob.OBConversion()
>>> mol = ob.OBMol()
>>> conv.ReadFile(mol, "latin1.sdf")
True
>>> mol.GetData("MP").GetValue()
'140-142\udcb0C'
Open Babel gives exactly the same results as OEChem.
Finally, RDKit:
>>> from rdkit import Chem
>>> supplier = Chem.ForwardSDMolSupplier("latin1.sdf")
>>> mol = next(supplier)
>>> mol.GetProp("MP")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte
RDKit doesn’t give a surrogate value for the illegal UTF-8 character. Instead, it complains. Which also means there is no way to get that data from Python.
What do you do if you have to read a Latin-1 encoded SD file? One solution is to use an external tool like iconv to translate the file to UTF8.
% iconv -f latin1 -t utf-8 < latin1.sdf > utf8.sdf
Another is to use Python to convert the entire file from Latin-1 to UTF8 then pass the transcoded contents to the toolkit:
>>> content = open("latin1.sdf", "rb").read()
>>> content = content.decode("latin1").encode("utf8")
>>>
>>> import chemfp
>>> for tk in ("openbabel", "openeye", "rdkit"):
... T = chemfp.get_toolkit(tk)
... for mol in T.read_molecules_from_string(content, "sdf"):
... print(tk, T.get_tag(mol, "MP"))
openbabel 140-142°C
openeye 140-142°C
rdkit 140-142°C
But if all you want is some of the tag data values, and not the molecule, then you can ask the text_toolkit to read the record as a “latin1” encoded file:
>>> from chemfp import text_toolkit
>>> for mol in text_toolkit.read_molecules("latin1.sdf", encoding="latin1"):
... print(mol.get_tag("MP"))
...
140-142°C
The content is converted on-demand, that is, only when
get_id()
or
text_toolkit.get_tag()
/SDFRecord.get_tag()
are
called. The text_toolkit’s “molecule” stores the encoding so it knows
how to decode the fields:
>>> mol.encoding
'latin1'
By the way, if you omit the ‘encoding=”latin1”’ parameter then you’ll get an exception:
>>> for mol in text_toolkit.read_molecules("latin1.sdf"):
... print(mol.get_tag("MP"))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/_text_toolkit.py", line 209, in get_tag
return get_sdf_record_tag(self.record, tag, self.encoding, self.encoding_errors)
File "chemfp/_text_toolkit.py", line 1445, in get_sdf_record_tag
return value.decode(encoding, encoding_errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte
Mixed encodings and raw bytes¶
In this section you’ll learn how to get access to the id and tag data as byte strings rather than Unicode strings. This might be used if you have a perverse file which uses multiple encodings. If you run into that case, let me know - I’ll give you a sympathy prize for having to deal with it.
In the previous section you learned a few ways to read an Latin-1 encoded SD file. What happens if the title line contains an id which is UTF-8 encoded while the tag data contains a Latin-1 encoded value? (Or if you have to deal with a ‘clever’ programmer who put in semi-binary data into a data field. Because that’s the sort of thing we clever programmers sometimes do.)
The techniques I mentioned in that previous section won’t work because they assume the entire file has the same encoding.
Instead, use the text_toolkit to read the file, but access it through the byte API rather than the string API.
I need an example file. I’ll start with the “latin1.sdf” file I created for the previous section, which uses a Latin-1 encoded degree symbol in the “MP” tag data. I’ll modify it so the “Beta” in the title line is replaced by the UTF-8 encoded “β” character.
>>> content = open("latin1.sdf", "rb").read()
>>> mixed_content = content.replace(b"Beta", u"\N{GREEK SMALL LETTER BETA}".encode("utf8"))
>>> open("mixed.sdf", "wb").write(mixed_content)
946
On a UTF-8 terminal the title line and the MP tag data line are respectively:
β-methylphenethylamine
140-142�C
On a Latin-1 terminal they are:
β-methylphenethylamine
140-142°C
How do I get their “real” values? I’ll use the text_toolkit to read the first record from the file:
>>> from chemfp import text_toolkit
>>> mol = next(text_toolkit.read_molecules("mixed.sdf"))
>>> mol
SDFRecord(id_bytes=b'\xce\xb2-methylphenethylamine'(id='β-methylphenethylamine'),
record=b'\xce\xb2-methylphenethylamine\n RDKit \n\n 10 10 0 ...',
encoding='utf8', encoding_errors='strict')
The title line is in utf8 so that’s not a problem
>>> print(mol.id)
β-methylphenethylamine
But I won’t be able to read the “MP” field because it’s not UTF-8 encoded:
>>> mol.get_tag("MP")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte
Instead, I’ll use SDFRecord.get_tag_as_bytes()
to get the
underlying bytes for the named tag, rather than as converted to a
Unicode string:
>>> mol.get_tag_as_bytes(b"MP")
'140-142\xb0C'
Once I have the bytes, I can decode them as Latin-1:
>>> print(mol.get_tag_as_bytes(b"MP").decode("latin1"))
140-142°C
Note that this function requires the tag name be the byte string which is found in the file. A Unicode name will raise an exception:
>>> mol.get_tag_as_bytes(u"MP")
Traceback (most recent call last):
...
ValueError: tag must be a byte string or None
Use method SDFRecord.get_tag_pairs_as_bytes()
to get the list
of all (tag, data) pairs, where both the tag and data are return as
byte strings.
>>> mol.get_tag_pairs_as_bytes()
[(b'MP', b'140-142\xb0C')]
Finally, use id_bytes
to get the raw bytes for the
identifier:
>>> mol.id_bytes
b'\xce\xb2-methylphenethylamine'
For example, if I read the file as Latin-1 then the Unicode value for
the “MP” tag will be what I expected, but the id won’t be
correct. Instead, I can get the SDFRecord.id_bytes
and decode
it manually as UTF-8:
>>> mol2 = next(text_toolkit.read_molecules("mixed.sdf", encoding="latin1"))
>>> print(mol2.get_tag("MP"))
140-142°C
>>> mol2.id
'β-methylphenethylamine'
>>>
>>> print(mol2.id_bytes.decode("utf8"))
β-methylphenethylamine