Experimental support for CDK

I've just released chemfp 3.5a1, which is an EXPERIMENTAL version with support for CDK. I'm looking for friendly users to try it out. It depends on JPype for the Python/Java interface. To install them, for Python 3.7 or later, on Linux-based OSes, do:

python -m pip install JPype1
python -m pip install --upgrade chemfp -i https://chemfp.com/experimental_cdk

Chemfp is a Python package for fingerprint generation and high-performance similarity search. It depends on a third-party package to handle chemistry, which traditionally were Open Babel, RDKit, and Open Eye's OEChem/OEGraphSim.

The new CDK support in principle includes:

  • fingerprint types: CDK-AtomPairs2D, CDK-Daylight, CDK-ECFP0, CDK-ECFP2, CDK-ECFP4, CDK-ECFP6, CDK-EState, CDK-Extended, CDK-FCFP0, CDK-FCFP2, CDK-FCFP4, CDK-FCFP6, CDK-GraphOnly, CDK-Hybridization, CDK-MACCS, CDK-Pubchem, CDK-ShortestPath, CDK-Substructure

  • input formats: smi, can, usm, sdf, smistring, canstring, usmstring, molfile, inchi, inchistring, pdb, cif, cml, ctx, crystclust, gamess, xyz

  • output formats: smi, can, usm, sdf, smistring, canstring, usmstring, molfile, inchi, inchikey, inchistring, inchikeystring, pdb, cdkjava, cml, crystclust, xyz

(In chemfp's format nomenclature, "smi", "can", and "usm" read a SMILES with optional id as input, but "smi" generates a canonical isomeric SMILES as output, "can" generates a canonical non-isomeric SMILES, and "usm" generate a non-canonical, isomeric SMILES. The "*string" variants ignore any possible id.)

NOTE: The CDK support is only lightly-tested. You'll need to double-check everything before you can trust the results, and there are likely many ways to make it crash. My goals with this experiment are to see:

1) if there's further interest in it (especially from those willing to pay for a license);

2) did I get the chemistry handling correct (ShortestPath fingerprint with input from the SMILES "c1ccccc1O" results in "java.lang.NullPointerException: atom has unset atom type");

3) which structure formats should I support / should I ignore? ;

4) how to handle Java packages from Python.

5) how CDK works, and more importantly, how it fails.

Chemfp is NOT a no-cost/open-source package. The base license agreement is https://chemfp.com/BaseLicense.txt . There is also a no-cost academic license, the option to get the source code with a proprietary license, or a more expensive option to get the source code under the MIT license. If that's okay with you then you can install it with:

python -m pip install JPype1
python -m pip install --upgrade chemfp -i https://chemfp.com/experimental_cdk

The JPype1 package installs the adapter that chemfp uses to work with the CDK jar file, which must be on your CLASSPATH.

The https://chemfp.com/experimental_cdk/ site has pre-built packages for Python 3.7, 3.8, and 3.9 on Linux-based OSes.

In case the license is an issue, or you want to jump right to the relevant source code, I placed a copy of those CDK-specific files at https://chemfp.com/experimental_cdk/chemfp/chemfp-cdk-toolkit.tar.gz .

These are distributed under the MIT license.

Acknowledgements

Many thanks to Noel O'Boyle for cinfony, which greatly helped me understand how to work with CDK from Python via Jpype, and thanks to John Mayfield for his pointers and feedback.

Examples

I figure people here might not have used chemfp before, so I'll give some examples of using it from the command-line and from Python.

The default reads a SMILES file from stdin and generates "CDK-Daylight" fingerprints in FPS format to stdout. The "Daylight" fingerprint is a a cdk.fingerprint.Fingerprinter() instance. I followed Noel's/cinfony's lead in calling it "Daylight" instead of "Fingerprint". Let me know if that's an issue.

% echo "c1ccccc1 phenol" | cdk2fps
#FPS1
#num_bits=1024
#type=CDK-Daylight/2.0 size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0
#software=CDK/2.3 chemfp/3.5a1
#date=2020-11-20T10:50:58
0000000000000000000000800000000000000000020000000000200000000020000000000000000000000000000000000000000000000000000000000000000000000000000000400000000000000000000000000000000000000000400000000000000000000000000000000000000000000000000000000000000000000000    phenol

I'll generate MACCS keys for ChEBI:

% cdk2fps --MACCS ChEBI_complete.sdf.gz
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
 ...

Oh, that's right - ChEBI doesn't use the title line of the SDF record, so I need to get it from the one of the tag fields:

% cdk2fps --MACCS ChEBI_complete.sdf.gz --id-tag "ChEBI ID"
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
#software=CDK/2.3 chemfp/3.5a1
#source=ChEBI_complete.sdf.gz
#date=2020-11-20T10:56:12
000000000000302180000003828101500ccda3501e  CHEBI:90
00000000000000000202008000890c400050b4821c  CHEBI:165
0000000000080000800002032012002c080e825708  CHEBI:598
00000000000000008200008480892dc00dc4a7d21e  CHEBI:776
000000000001000000004000400400002080200017  CHEBI:943
000000000000200080000002800002040c0482d608  CHEBI:1148
 ...

This fails after a while with:

000000000008000102000481028381100081a1501e  CHEBI:26377
Traceback (most recent call last):
 ...
 File "/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/types.py", line 845, in _compute_molecule_fingerprints
   yield id, fingerprinter(mol)
 File "/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/cdk_types.py", line 226, in cdk_maccs_fingerprinter
   fp = fingerprinter.getBitFingerprint(mol)
java.lang.NullPointerException: java.lang.NullPointerException: Aromaticity model requires implicit hydrogen count is set.

because the current chemfp design expects that a generating a fingerprint would never fail.

The "cdk2fps" program supports all of the CDK fingerprint types which generate binary strings, except that JPype can't find the KlekotaRoth fingerprinter, and ShortestPath gives me the exception "java.lang.NullPointerException: atom has unset atom type" even if I read "c1ccccc1O" as SMILES input.

Once you have FPS files, you can search them with "simsearch". To learn more about the chemfp command-line programs, see https://chemfp.readthedocs.io/en/latest/using-tools.html .

Chemfp also has a toolkit I/O API, which is a wrapper layer around the underlying toolkit that gives a consistent way to read and write structure files, and string containing structures. Here's an example of converting phenol into "sdf" format:

>>> from chemfp import cdk_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smi")
>>> print(T.create_string(mol, "sdf"))

 CDK     1120201035

 7  7  0  0  0  0  0  0  0  0999 V2000
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
 1  2  2  0  0  0  0
 2  3  1  0  0  0  0
 3  4  2  0  0  0  0
 4  5  1  0  0  0  0
 5  6  2  0  0  0  0
 1  6  1  0  0  0  0
 6  7  1  0  0  0  0
M  END
$$$$

The CDK SDF writer supports many options, which can be specified in an optional "writer_args" dictionary. Chemfp support introspection for the different formats, so here's how to get the default writer arguments:

>>> T.get_format("sdf").get_default_writer_args()
{'WriteAromaticBondTypes': False, 'WriteMajorIsotopes': True,
'writeProperties': True, 'WriteQueryFormatValencies': False,
'ProgramName': 'CDK', 'ForceWriteAs2DCoordinates': False,
'WriteDefaultProperties': True, 'writeV3000': False}

The "writeV3000" looks interesting, so I'll switch it to True:

>>> print(T.create_string(mol, "sdf", writer_args={"writeV3000": True}))

 CDK     1120201042

 0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 2
M  V30 2 1 2 3
M  V30 3 2 3 4
M  V30 4 1 4 5
M  V30 5 2 5 6
M  V30 6 1 1 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

Yup, that's a V3000 record.

Chemfp support introspection of the different fingerprinters. In chemfp, a "fingerprint family" is the class of possible fingerprint types, where a "fingerprint type" is a specific instantiation of that family. There's only one MACCS fingerprint type in the MACCS fingerprint family, while the "Daylight" family has a few possible parameters.

>>> import chemfp
>>> chemfp.get_fingerprint_families("cdk")
[FingerprintFamily(<CDK-AtomPairs2D/2.0>), FingerprintFamily(<CDK-Daylight/2.0>),
FingerprintFamily(<CDK-ECFP0/2.0>), FingerprintFamily(<CDK-ECFP2/2.0>),
FingerprintFamily(<CDK-ECFP4/2.0>), FingerprintFamily(<CDK-ECFP6/2.0>),
FingerprintFamily(<CDK-EState/2.0>), FingerprintFamily(<CDK-Extended/2.0>),
FingerprintFamily(<CDK-FCFP0/2.0>), FingerprintFamily(<CDK-FCFP2/2.0>),
FingerprintFamily(<CDK-FCFP4/2.0>), FingerprintFamily(<CDK-FCFP6/2.0>),
FingerprintFamily(<CDK-GraphOnly/2.0>), FingerprintFamily(<CDK-Hybridization/2.0>),
FingerprintFamily(<CDK-MACCS/2.0>), FingerprintFamily(<CDK-Pubchem/2.0>),
FingerprintFamily(<CDK-ShortestPath/2.0>), FingerprintFamily(<CDK-Substructure/2.0>)]

>>> family = chemfp.get_fingerprint_family("CDK-Daylight")
>>> family.get_defaults()
{'size': 1024, 'searchDepth': 7, 'pathLimit': 42000, 'hashPseudoAtoms': 0}

Fingerprint types can be created in one of several ways. Here's the most common:

>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("CDK-MACCS")

This has a number of methods, like:

>>> fptype.parse_id_and_molecule_fingerprint("c1ccccc1O phenol", "smi")
('phenol', b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e')

Or, if you have a molecule already, you can compute its fingerprint:

>>>  from chemfp import cdk_toolkit as T
>>>  mol = T.parse_molecule("c1ccccc1P", "smistring")
>>>  fptype.compute_fingerprint(mol)
b'\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'

The fingerprint type has a reference to the relevant toolkit:

>>> fptype.toolkit
<module 'chemfp.cdk_toolkit' from '/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/cdk_toolkit.py'>

Which means if you want to convert an input SMILES file into an SD file, and include a tag based on the toolkit's MACCS generation code, then you can do:

import chemfp
# fptype = chemfp.get_fingerprint_type("RDKit-MACCS166") # for RDKit
# fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS") # for Open Babel
# fptype = chemfp.get_fingerprint_type("CDK-MACCS") # for OEChem/OEGraphSim
fptype = chemfp.get_fingerprint_type("CDK-MACCS")
T = fptype.toolkit
with T.read_molecules("ZINC_20_structures.smi") as reader:
 with T.open_molecule_writer("ZINC_20_structures.sdf") as writer:
   for mol in reader:
     fp = fptype.compute_fingerprint(mol)
     T.add_tag(mol, T.name + "_MACCS", fp.hex())
     writer.write_molecule(mol)

If for some reason you wanted to use Open Babel's MACCS code then comment out the "CDK-MACCS" line and un-comment the "OpenBabel-MACCS" line, and it will work.

Here's the first output record:

% cat ZINC_20_structures.sdf
ZINC00000036
 CDK     1120201039

11 11  0  0  1  0  0  0  0  0999 V2000
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
 1  2  2  0  0  0  0
 2  3  1  0  0  0  0
 3  4  2  0  0  0  0
 4  5  1  0  0  0  0
 5  6  2  0  0  0  0
 1  6  1  0  0  0  0
 4  7  1  0  0  0  0
 7  8  1  0  0  0  0
 8  9  2  0  0  0  0
 8 10  1  0  0  0  0
 7 11  1  0  0  0  0
M  CHG  1  10  -1
M  END
<cdk_MACCS>
00000000000001008000000000000004000482521e

$$$$

Do try it out and give me friendly feedback!