Fingerprint family and type examples¶
This chapter describes how to use the fingerprint family and fingerprint type API added in chemfp 2.0.
Fingerprint families and types¶
In this section you’ll learn the difference between a fingerprint family and a fingerprint type. You will need Compound_099000001_099500000.sdf.gz from PubChem to work though all of the examples.
Chemfp distinguishes between a “fingerprint family” and a “fingerprint type.” A fingerprint family describes the general approach for doing a fingerprint, like “the OpenEye path-based fingerprint method”, while a fingerprint type describes the specific parameters used for a given approach, such as “the OpenEye path-based fingerprint method using path lengths between 0 and 5 bonds, where the atom types are based on the atomic number and aromaticity, and the bond type is based on the bond order, mapped to a 256 bit fingerprint.”
(In object-oriented terms, a fingerprint family is the class and a fingerprint type is an instance of the class.)
I’ll use chemfp.get_fingerprint_family()
to get the
FingerprintFamily
for “OpenEye-Path”. On the laptop where I’m
writing the documentation, this resolves to what chemfp calls version
“2”:
>>> import chemfp
>>> family = chemfp.get_fingerprint_family("OpenEye-Path")
>>> family
FingerprintFamily(<OpenEye-Path/2>)
The fingerprint family can be called like a function to return a
FingerprintType
. If you call it with no arguments it will
use the defaults parameters for that family. I’ll do that, then use
get_type()
to get the fingerprint type string,
which is the canonical representation of the fingerprint family name,
version, and parameters:
>>> fptype = family()
>>> fptype.get_type()
'OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral'
You do not need the fingerprint family to get to this point. If you
know you want an OpenEye path fingerprint type, you can also get it
from the openeye_toolkit
module as the path
object, or you
can use the short-cut:
>> fptype = chemfp.openeye.path
A 4096 bit fingerprint is rather large. I’ll make a new OpenEye-Path fingerprint type, but this time with only 256 bits. That’s small enough that the resulting fingerprint will fit on a line of documentation. All of the other parameters will be unchanged:
>>> fptype = family(numbits=256)
>>> fptype
OpenEyePathFingerprintType_v2(<OpenEye-Path/2 numbits=256 minbonds=0 maxbonds=5
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
>>> print(fptype.get_metadata())
#num_bits=256
#type=OpenEye-Path/2 numbits=256 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#date=2023-05-16T12:15:56
This time I used FingerprintType.get_metadata()
to give
information about the fingerprint. This returns a new
Metadata
instance which describes the fingerprint type, and
if you print a Metadata it displays the metadata information as an FPS
header.
Another way to make this fingerprint type is via the path
object
in the openeye_toolkit
module, such as the following:
>> fptype = chemfp.openeye.path(numbits=256)
Once you have the fingerprint type you can create fingerprints, including directly from a SMILES string, as in the following:
>>> fp = fptype.from_smistring("c1ccccc1O")
>>> fp.hex()
'0012250160901000080c002810000400201000900054880442000e8040201000'
and from a structure file:
>>> for id, fp in fptype.read_molecule_fingerprints("Compound_099000001_099500000.sdf.gz"):
... print(id, fp.hex())
... if int(id) > 99003537: break
...
99000039 b7f1ff7cf3f377ebf37ff6ffefb5c9fffe69fffbfdfefedf77f5dffee0f7f907
99000230 ffd5f775cffbd790f97f5f797fbefdcd3fcf73efdf5fdfbf7fe6d9df60fd5303
99001517 4c66ed1831f29586d8465c387396d77ddf1743e452053ecd5baf54bd74b27900
99002251 ba5ff7e5fbfd3ce77decb9aef9a5b5eef7615cd3df5efc0e7f78effc7dfd9a07
99003537 defbbff7f4f57f6fbdfffab35ffddb77fef7dfddfafffffddff77fedeb97f107
99003538 defbbff7f4f57f6fbdfffab35ffddb77fef7dfddfafffffddff77fedeb97f107
Even though I used the fingerprint family to get the type, I did that
more for pedagogical reasons. Most times you can get the fingerprint
type directly using chemfp.get_fingerprint_type()
. You can call
it using a fingerprint type string or by passing in the parameters in
the optional second parameter::
>>> fptype = chemfp.get_fingerprint_type("OpenEye-Path numbits=256")
>>> fptype = chemfp.get_fingerprint_type("OpenEye-Path", {"numbits": 256})
See get_fingerprint_type() and get_type() for examples on how to use
get_fingerprint_type
.
You can also get ready-made fingerprint type objects as objects in
each of the cheminformatics toolkit wrapper libraries, which is
chemfp.openeye_toolkit
for the OEChem/OEGraph toolkits:
>>> from chemfp import openeye_toolkit as T
>>> T.path
OpenEyePathFingerprintType_v2(<OpenEye-Path/2 numbits=4096 minbonds=0
maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral>)
Use help()`
or ?
in the Jupyter notebook to learn more
about the fingerprint type parameters. The fingerprint types are
callback, which gives you a quick way to change the values you want:
>>> T.path(numbits=128, maxbonds=3)
OpenEyePathFingerprintType_v2(<OpenEye-Path/2 numbits=128 minbonds=0
maxbonds=3 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral>)
Fingerprint family¶
In this section you’ll learn about the attributes and methods of a fingerprint family.This API is rarely needed when using chemfp!
The get_fingerprint_family()
function takes the fingerprint
family name (with or without a version) and returns a
FingerprintFamily
instance:
>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
It will raise a ValueError if you ask for a fingerprint family or version which doesn’t exist:
>>> chemfp.get_fingerprint_family("whirl")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 2513, in get_fingerprint_family
return _family_registry.get_family(family_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "chemfp/types.py", line 1872, in get_family
return self._resolve_name(family_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "chemfp/types.py", line 1855, in _resolve_name
raise FingerprintTypeUnknownError(family_name, self, toolkit_name)
chemfp.types.FingerprintTypeUnknownError: Unknown fingerprint type: 'whirl'
The fingerprint family has several attributes to ask for the name or parts of the name:
>>> family
FingerprintFamily(<RDKit-Fingerprint/3>)
>>> family.name
'RDKit-Fingerprint/3'
>>> (family.base_name, family.version)
('RDKit-Fingerprint', '3')
It also has a toolkit
attribute, which is the underlying chemfp
toolkit that can create molecules for this fingerprint:
>>> family.toolkit
<module 'chemfp.rdkit_toolkit' from 'chemfp/rdkit_toolkit.py'>
>>> family.toolkit.name
'rdkit'
See the chapter Toolkit API examples for many examples of how to use a toolkit.
The get_defaults()
method returns the
default arguments used to create a fingerprint type, which is handy
when you’ve forgotten what all of the arguments are:
>>> family.get_defaults()
{'fpSize': 2048, 'minPath': 1, 'maxPath': 7, 'countSimulation': 0,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
If you call the family as a function, you’ll get a
FingerprintType
. You can check to see that the fingerprint
type’s keyword arguments match the defaults:
>>> fptype = family()
>>> fptype.fingerprint_kwargs
{'fpSize': 2048, 'minPath': 1, 'maxPath': 7, 'countSimulation': 0,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
Call the fingerprint family with keyword arguments to use something other than the default parameters:
>>> fptype = family(fpSize=1024, maxPath=6)
>>> fptype.fingerprint_kwargs
{'fpSize': 1024, 'minPath': 1, 'maxPath': 6, 'countSimulation': 0,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
If you have the keyword arguments as a dictionary you can use the
“**” syntax to apply the dictionary as keyword arguments, but I
think it’s clearer to call the FingerprintFamily.from_kwargs()
method to create the fingerprint type:
>>> kwargs = {"fpSize": 512, "maxPath": 5}
>>> fptype = family(**kwargs) # Acceptable
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=512 minPath=1 maxPath=5'
>>> fptype = family.from_kwargs(kwargs) # Better
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=512 minPath=1 maxPath=5'
(Currently family(**kwargs)
forwards the the call to
family.from_kwargs(kwargs)
so there is a slight performance
advantage to using from_kwargs()
.)
Sometimes the fingerprint parameters come from a string, for example, from command-line arguments or a web form. In chemfp a dictionary of text keys and values are called “text settings”. The fingerprint family has a helper function to process them and create a kwargs dictionary with the correct data types as values:
>>> family.get_kwargs_from_text_settings({
... "fpSize": "128",
... "countBounds": "1,3,7,10",
... })
{'fpSize': 128, 'minPath': 1, 'maxPath': 7, 'countSimulation': 0,
'countBounds': [1, 3, 7, 10], 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
Note: This method is not as advanced as the corresponding code
in the toolkit Format API
.
It does not understand namespaces. It will also raise an exception if
called with an unsupported parameter:
>>> family.get_kwargs_from_text_settings({
... "unsupported parameter": "-12.34",
... })
Traceback (most recent call last):
...
ValueError: Unsupported fingerprint parameter name 'unsupported parameter'
If you have text settings then you probably want to call
chemfp.get_fingerprint_type_from_text_settings()
directly instead of
going through the fingerprint family:
>>> fptype = chemfp.get_fingerprint_type_from_text_settings("RDKit-Fingerprint",
... {"fpSize": "512", "countBounds": "1,4,8", "maxPath": "6"})
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=512 minPath=1 maxPath=6'
See Create a fingerprint using text settings for more examples of how to use this function.
Fingerprint family discovery¶
In this section you’ll learn how to get the available fingerprint families, both as a set of name strings and a list of FingerprintFamily instances.
Even though chemfp knows about the OpenEye fingerprints, those fingerprints might not be available on your system if you don’t have OEChem and OEGraphSim installed and licensed. Chemfp has a discovery system which will probe to see which fingerprint types are available and determine their version numbers.
If you just want the available family names, use
chemfp.get_fingerprint_family_names()
:
>>> import chemfp >>> chemfp.get_fingerprint_family_names()
{'RDKit-Fingerprint/1', 'RDKit-Pattern/1', 'OpenBabel-ECFP4',
'CDK-ECFP4', 'OpenBabel-ECFP0', 'RDKit-Torsion/2',
'OpenBabel-ECFP6', 'CDK-Substructure', 'RDKit-Fingerprint/3',
'RDKit-AtomPair/1', 'CDK-FCFP2', 'CDK-FCFP6',
'ChemFP-Substruct-OpenEye', 'CDK-ECFP0', 'OpenBabel-ECFP8',
'CDK-Extended', 'RDMACCS-OpenBabel', 'OpenBabel-ECFP10',
'RDMACCS-OpenEye', 'RDKit-Torsion/3', 'CDK-KlekotaRoth',
'RDKit-Morgan', 'OpenEye-Tree', 'CDK-FCFP0',
'ChemFP-Substruct-RDKit/1', 'RDMACCS-RDKit/2', 'CDK-Hybridization',
'RDKit-Torsion', 'RDKit-SECFP/1', 'CDK-MACCS', 'CDK-ShortestPath',
'RDKit-AtomPair', 'RDKit-MACCS166', 'RDKit-Morgan/1',
'RDKit-Pattern/2', 'RDKit-MACCS166/2', 'OpenBabel-FP2', 'CDK-FCFP4',
'CDK-ECFP6', 'RDKit-Torsion/4', 'RDMACCS-RDKit/1', 'CDK-ECFP2',
'CDK-AtomPairs2D', 'OpenBabel-FP4', 'RDKit-Avalon', 'CDK-Pubchem',
'ChemFP-Substruct-CDK', 'OpenBabel-ECFP2', 'RDKit-Fingerprint/2',
'RDKit-SECFP', 'OpenEye-MACCS166', 'RDKit-Torsion/1',
'OpenEye-MoleculeScreen', 'RDKit-AtomPair/3', 'OpenBabel-FP3',
'OpenBabel-MACCS', 'OpenEye-SMARTSScreen', 'RDKit-AtomPair/2',
'OpenEye-Circular', 'ChemFP-Substruct-OpenBabel', 'RDMACCS-RDKit',
'RDKit-Pattern', 'OpenEye-Path', 'CDK-GraphOnly',
'OpenEye-MDLScreen', 'RDKit-Fingerprint', 'RDKit-Pattern/3',
'ChemFP-Substruct-RDKit', 'CDK-Daylight', 'CDK-EState',
'RDKit-Pattern/4', 'RDMACCS-CDK', 'RDKit-Avalon/1',
'RDKit-Morgan/2', 'RDKit-MACCS166/1'})
You can ask the function to return only those fingerprints generated
from a given toolkit then use the toolkit_name
parameter. The
following returns the Open Babel fingerprints:
>>> chemfp.get_fingerprint_family_names(toolkit_name="openbabel")
{'OpenBabel-ECFP0/1', 'OpenBabel-ECFP8', 'OpenBabel-ECFP4/1',
'OpenBabel-ECFP10/1', 'OpenBabel-ECFP4', 'OpenBabel-ECFP6/1',
'OpenBabel-ECFP8/1', 'OpenBabel-FP2', 'RDMACCS-OpenBabel',
'OpenBabel-FP4', 'RDMACCS-OpenBabel/1', 'OpenBabel-ECFP0',
'ChemFP-Substruct-OpenBabel/1', 'OpenBabel-ECFP2',
'ChemFP-Substruct-OpenBabel', 'OpenBabel-FP2/1',
'OpenBabel-ECFP2/1', 'OpenBabel-ECFP6', 'OpenBabel-FP4/1',
'OpenBabel-FP3/1', 'OpenBabel-MACCS/2', 'OpenBabel-MACCS',
'RDMACCS-OpenBabel/2', 'OpenBabel-ECFP10', 'OpenBabel-FP3'}
The function returns a set of names, both with and without any versions. Most likely you want to sort it before displaying it more nicely:
>>> for name in sorted(chemfp.get_fingerprint_family_names(
... include_unavailable=False)):
... print(name)
...
ChemFP-Substruct-OpenBabel
ChemFP-Substruct-OpenBabel/1
ChemFP-Substruct-RDKit
ChemFP-Substruct-RDKit/1
OpenBabel-ECFP0
OpenBabel-ECFP0/1
OpenBabel-ECFP10
OpenBabel-ECFP10/1
OpenBabel-ECFP2
OpenBabel-ECFP2/1
OpenBabel-ECFP4
OpenBabel-ECFP4/1
OpenBabel-ECFP6
OpenBabel-ECFP6/1
OpenBabel-ECFP8
OpenBabel-ECFP8/1
OpenBabel-FP2
OpenBabel-FP2/1
OpenBabel-FP3
OpenBabel-FP3/1
OpenBabel-FP4
OpenBabel-FP4/1
OpenBabel-MACCS
OpenBabel-MACCS/2
RDKit-AtomPair
RDKit-AtomPair/2
RDKit-AtomPair/3
RDKit-Avalon
RDKit-Avalon/1
RDKit-Fingerprint
RDKit-Fingerprint/2
RDKit-Fingerprint/3
RDKit-MACCS166
RDKit-MACCS166/2
RDKit-Morgan
RDKit-Morgan/1
RDKit-Morgan/2
RDKit-Pattern
RDKit-Pattern/4
RDKit-SECFP
RDKit-SECFP/1
RDKit-Torsion
RDKit-Torsion/2
RDMACCS-OpenBabel
RDMACCS-OpenBabel/1
RDMACCS-OpenBabel/2
RDMACCS-RDKit
RDMACCS-RDKit/1
RDMACCS-RDKit/2
>>>
The above returned a list of all fingerprint family names, including those which aren’t actually available for the given Python installation, by setting the include_unavailable parameter to True.
The FingerprintFamily
includes the attributes to get the
name
and
version
but it doesn’t have a way to get
the default number of bits. Instead, I’ll use the FingerprintFamily to
make a FingerprintType
with the default parameters, then ask
the new fingerprint type its number of bits
.
This means I need a list of FingerprintFamily instances, which is
conveniently available from
chemfp.get_fingerprint_families()
. (Remember, this may take a
few seconds the first time it’s called, because it tries to load all
of the available fingerprints. Once determined, this information is
cached.)
As a result, you can make a list of all available fingerprint methods (in this setup, only Open Babel and RDKit are installed), and their default number of bits with the following:
>>> for family in sorted(chemfp.get_fingerprint_families(),
... key=lambda family: family.name):
... print(family.name, family().num_bits)
...
ChemFP-Substruct-OpenBabel/1 881
ChemFP-Substruct-RDKit/1 881
OpenBabel-ECFP0/1 4096
OpenBabel-ECFP10/1 4096
OpenBabel-ECFP2/1 4096
OpenBabel-ECFP4/1 4096
OpenBabel-ECFP6/1 4096
OpenBabel-ECFP8/1 4096
OpenBabel-FP2/1 1021
OpenBabel-FP3/1 55
OpenBabel-FP4/1 307
OpenBabel-MACCS/2 166
RDKit-AtomPair/2 2048
RDKit-AtomPair/3 2048
RDKit-Avalon/1 512
RDKit-Fingerprint/2 2048
RDKit-Fingerprint/3 2048
RDKit-MACCS166/2 166
RDKit-Morgan/1 2048
RDKit-Morgan/2 2048
RDKit-Pattern/4 2048
RDKit-SECFP/1 2048
RDKit-Torsion/2 2048
RDMACCS-OpenBabel/1 166
RDMACCS-OpenBabel/2 166
RDMACCS-RDKit/1 166
RDMACCS-RDKit/2 166
Use the toolkit_name
parameter to get only those fingerprint
families for a given toolkit:
>>> print(*chemfp.get_fingerprint_families(toolkit_name="rdkit"), sep="\n")
FingerprintFamily(<ChemFP-Substruct-RDKit/1>)
FingerprintFamily(<RDKit-AtomPair/2>)
FingerprintFamily(<RDKit-AtomPair/3>)
FingerprintFamily(<RDKit-Avalon/1>)
FingerprintFamily(<RDKit-Fingerprint/2>)
FingerprintFamily(<RDKit-Fingerprint/3>)
FingerprintFamily(<RDKit-MACCS166/2>)
FingerprintFamily(<RDKit-Morgan/1>)
FingerprintFamily(<RDKit-Morgan/2>)
FingerprintFamily(<RDKit-Pattern/4>)
FingerprintFamily(<RDKit-SECFP/1>)
FingerprintFamily(<RDKit-Torsion/2>)
FingerprintFamily(<RDMACCS-RDKit/1>)
FingerprintFamily(<RDMACCS-RDKit/2>)
Finally, use chemfp.has_fingerprint_family()
to test if a
fingerprint family is available, in this case where CDK is available:
>>> chemfp.has_fingerprint_family("CDK-Pubchem")
True
>>> chemfp.has_fingerprint_family("CDK-Pubchem/2.9")
True
>>> chemfp.has_fingerprint_family("CDK-Pubchem/2.0")
False
It understands both version and unversioned names.
get_fingerprint_type() and get_type()¶
In this section you’ll learn how to get a fingerprint type given its type string, and how to specify fingerprint parameters as a dictionary.
The easiest way to get a specific FingerprintType
from a
chemfp fingerprint type string is with
chemfp.get_fingerprint_type()
:
>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint")
>>> fptype
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7>)
(This is also available as chemfp.rdkit_toolkit.rdk
.)
The fingerprint type has a FingerprintType.get_type()
method,
which returns the canonical fingerprint type string:
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7'
This is canonical because chemfp ensures that all fingerprint type strings with the same parameter values have the same type string.
I left out the version number in the fingerprint name when I asked for the fingerprint, so chemfp gives me the most recent supported version. I could have included the version in the name, which is useful if you want to prevent a version mismatch between your data sets. If the version doesn’t exist, the function will raise a ValueError:
>>> fptype3 = chemfp.get_fingerprint_type("RDKit-Fingerprint/3")
>>> fptype3
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7>)
>>> fptype2 = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
>>> fptype2
RDKitFingerprintType_v2(<RDKit-Fingerprint/2 minPath=1 maxPath=7
fpSize=2048 nBitsPerHash=2 useHs=1>)
>>> fptype1 = chemfp.get_fingerprint_type("RDKit-Fingerprint/1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 2596, in get_fingerprint_type
return types.registry.get_fingerprint_type(type, fingerprint_kwargs)
File "chemfp/types.py", line 1909, in get_fingerprint_type
family = self._resolve_name(name)
File "chemfp/types.py", line 1853, in _resolve_name
return self._family_resolvers[base_name].resolve(version)
File "chemfp/types.py", line 2090, in resolve
raise obj.copy() # known fingerprint type/version but not available
chemfp.types.FingerprintUnavailableError: Unable to use RDKit-Fingerprint/1:
This version of RDKit does not support the RDKit-Fingerprint/1 fingerprint
Why does chemfp support two different RDKit-Fingerprint versions? Because the toolkit has two different APIs to generate the given fingerprint type. Version 3 maps to the newer “generator” API, while version 2 maps to the older function API. The two APIs take different parameters, and new features are only added to the new API.
Chemfp support both of them so that old fingerprint datasets are still usable so long as RDKit supports the older API.
The version 1 fingerprint type was used for the RDKit fingerprint generation algorithm used before 2014 or so. Neither chemfp nor RDKit support it any more.
The pre-generated chemfp.rdkit_toolkit.rdk
fingerprint type is
always the current version for the toolkit. It is also available
without directly importing the rdkit_toolkit module by using
chemfp.rdkit.rdk
:
>>> chemfp.rdkit.rdk
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7>)
or the version-specific attributes chemfp.rdkit.rdk_v3
and
chemfp.rdkit.rdk_v2
:
>>> chemfp.rdkit.rdk_v3
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=2048 minPath=1 maxPath=7>)
>>> chemfp.rdkit.rdk_v2
RDKitFingerprintType_v2(<RDKit-Fingerprint/2 minPath=1 maxPath=7
fpSize=2048 nBitsPerHash=2 useHs=1>)
I can also specify some or all of the parameters myself in the type string, instead of accepting the default values:
>>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024 maxPath=6")
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=1024 minPath=1 maxPath=6'
Or by passing the parameters in as a Python dictionary, though you still need at least the base name of the fingerprint family:
>>> fp_kwargs = dict(maxPath=4, fpSize=512)
>>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", fp_kwargs)
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=512 minPath=1 maxPath=4'
If a parameter is specified in both the type string and the dictionary then the dictionary value will be used:
>>> fptype = chemfp.get_fingerprint_type(
... "RDKit-Fingerprint fpSize=1024 minPath=2", {"fpSize": 128})
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=128 minPath=2 maxPath=7'
Finally, I can use Python’s call syntax on the default fingerprint type object, which accepts keyword arguments and returns a new fingerprint type with updated values:
>> chemfp.rdkit.rdk(fpSize=1024, minPath=3, maxPath=4)
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=1024 minPath=3 maxPath=4>)
Create a fingerprint using text settings¶
In this section you’ll learn how to get a fingerprint type using text settings, which is a dictionary of parameters where the values are all encoded as string.
The fingerprint keywords arguments (“kwargs”) are a dictionary whose keys are fingerprint parameter names and whose values are native Python objects for those parameters. Here is a fingerprint kwargs dictionary for the RDKit-Fingerprint:
{'maxPath': 7, 'fpSize': 1024, 'countSimulation': True}
Text settings are a dictionary where the dictionary keys are still parameter names but where the dictionary values are string-encoded parameter values. Here is the equivalent text settings for the above kwargs dictionary:
{'maxPath': '7', 'fpSize': '1024', 'countSimulation': "true"}
A text settings dictionary typically comes from command-line parameters or a configuration file, where everything is a string. The fingerprint family has a method to convert text settings to kwargs:
>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> kwargs = family.get_kwargs_from_text_settings(
... {'maxPath': '7', 'fpSize': '1024', 'countSimulation': "1"})
>>> kwargs
{'fpSize': 1024, 'minPath': 1, 'maxPath': 7, 'countSimulation': 1,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
The kwargs can then be used to get the specified fingerprint type from the family:
>>> fptype = family.from_kwargs(kwargs)
>>> fptype
RDKitFingerprintType_v3(<RDKit-Fingerprint/3 fpSize=1024 minPath=1
maxPath=7 countSimulation=1>)
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=1024 minPath=1 maxPath=7 countSimulation=1'
It’s a bit tedious to go through all those steps to process some text
settings. Instead, call
chemfp.get_fingerprint_type_from_text_settings()
:
>>> fptype = chemfp.get_fingerprint_type_from_text_settings(
... "RDKit-Fingerprint", {'maxPath': '7', 'fpSize': '1024', 'countSimulation': "1"})
>>> fptype.get_type()
'RDKit-Fingerprint/3 fpSize=1024 minPath=1 maxPath=7 countSimulation=1'
The parameters in the text settings have priority should the fingerprint type string and the text settings both specify the same parameter name, as in this example where the fingerprint type string specifies a 1024 bit fingerprint while the text settings specifies a 4096 bit fingerprint:
>>> fptype = chemfp.get_fingerprint_type_from_text_settings(
... "RDKit-Fingerprint fpSize=1024", {"fpSize": "4096"})
>>> fptype.num_bits
4096
At present there is no support for parameter namespaces, and unknown parameter names will raise an exception:
>>> fptype = chemfp.get_fingerprint_type_from_text_settings(
... "RDKit-Fingerprint", {"fpSize": "4096", "spam": "eggs"})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 2632, in get_fingerprint_type_from_text_settings
return types.registry.get_fingerprint_type_from_text_settings(type, settings)
File "chemfp/types.py", line 1941, in get_fingerprint_type_from_text_settings
raise FingerprintTypeParameterError(type, parsed, str(err)) from None
chemfp.types.FingerprintTypeParameterError: Unsupported fingerprint parameter name 'spam': 'RDKit-Fingerprint'
This may change in the future; let me know what’s best for you.
For now, if you want to remove unexpected names from a dictionary then
use the fingerprint family’s get_defaults()
to get the default kwargs as a dictionary, and use the keys to filter
out the unknown parameters:
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> defaults = family.get_defaults()
>>> defaults
{'fpSize': 2048, 'minPath': 1, 'maxPath': 7, 'countSimulation': 0,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
>>> settings = {"maxPath": "8", "unknown": "mystery"}
>>> new_settings = dict((k, v) for (k,v) in settings.items() if k in defaults)
>>> new_settings
{'maxPath': '8'}
If you have a fingerprint type object you alternatively get its current settings:
>>> chemfp.rdkit.rdk.fingerprint_kwargs
{'fpSize': 2048, 'minPath': 1, 'maxPath': 7, 'countSimulation': 0,
'countBounds': None, 'numBitsPerFeature': 2, 'useHs': 1,
'branchedPaths': 1, 'useBondOrder': 1, 'fromAtoms': None}
FingerprintType properties and methods¶
In this section you’ll learn about the FingerprintType
properties and methods.
I’ll start by getting CDK’s Daylight-like fingerprint using the
default parameters and the special chemfp.cdk_toolkit.daylight
attribute:
>>> import chemfp
>>> fptype = chemfp.cdk.daylight
>>> # fptype = chemfp.get_fingerprint_type("CDK-Daylight") # alternative
>>> fptype
CDKDaylightFingerprintType_v20(<CDK-Daylight/2.0 size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0>)
>>> fptype.get_type()
'CDK-Daylight/2.0 size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0'
The “CDK-Daylight/2” is the fingerprint name
,
which is decomposed into the base_name
“CDK-Daylight”
and the version
“2”:
>>> fptype.name
'CDK-Daylight/2.0'
>>> fptype.base_name, fptype.version
('CDK-Daylight', '2.0')
The number of bits for the fingerprint is num_bits
, and
fingerprint_kwargs
is a fingerprint
parameters as a dictionary of Python values:
>>> fptype.num_bits
1024
>>> fptype.fingerprint_kwargs
{'size': 1024, 'searchDepth': 7, 'pathLimit': 42000, 'hashPseudoAtoms': 0}
Each fingerprint type has a toolkit
, which
is the chemfp toolkit that can make molecules used as input to the
fingerprint type. (This would be None if there were no toolkit.) Given
a fingerprint type it’s easy to figure out the toolkit.name
of the toolkit it’s associated with:
>>> fptype.toolkit.name
'cdk'
The software
attribute gives information
about the software used to generate the fingerprint. For RDKit, Open
Babel, and CDK this is the same as the toolkit.software
string. On the other hand, OpenEye distributes OEChem and OEGraphSim
as two different libraries. These map quite naturally to chemfp’s
concepts of fingerprint type and toolkit, so the “software” field for
its fingerprint type and toolkit differ:
>>> oefptype = chemfp.openeye.tree
>>> # oefptype= chemfp.get_fingerprint_type("OpenEye-Tree") # alternative
>>> oefptype.software
'OEGraphSim/2.5.4.1 (20220607) chemfp/4.1'
>>> oefptype.toolkit.software
'OEChem/20220607'
Finally, FingerprintType.get_fingerprint_family()
returns the
fingerprint family for a given fingerprint type:
>>> oefptype.get_fingerprintfamily()
FingerprintFamily(<OpenEye-Tree/2>)
Convert a structure record to a fingerprint¶
In this section you’ll learn how to use a fingerprint type to convert a structure record into a fingerprint.
The FingerprintType
method
parse_molecule_fingerprint()
parses a
structure record and returns the fingerprint as a byte string. The
following uses Open Babel to get the MACCS fingerprint for phenol:
>>> import chemfp
>>> fptype = chemfp.openbabel.maccs
>>> # fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS") # alternative
>>> fptype
OpenBabelMACCSFingerprintType_v2(<OpenBabel-MACCS/2>)
>>> fp = fptype.from_smistring("c1ccccc1O")
>>> fp
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e'
>>> fp.hex()
'00000000000000000000000000000140004480101e'
The from_smistring
is one of several short-cut methods for
generating a fingerprint from a structure record. Only the most common
formats are supported this way:
fptype.from_inchi()
: from an InChI file record, with optional identifierfptype.from_inchistring()
: from an InChI string, with no identifierfptype.from_sdf()
: from an SDF recordfptype.from_smi()
: from a SMILES file record, with optional identifierfptype. from_smistring ()
: from a SMILES string, with no identifierfptype.from_smiles()
: alias forfrom_smistring
These methods also accept reader arguments as keyword parameters
(rather than passing in a generic writer_args
), with a useful
help()
.
As with chemfp’s other structure readers, they accept an errors
parameter, to describe what to do in case of problems. The default is
“strict”, which raises an exception:
>>> fptype.from_smiles("Q")
==============================
*** Open Babel Error in ParseSimple
SMILES string contains a character 'Q' which is invalid
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/openbabel_types.py", line 209, in from_smistring
mol = self.toolkit.parse_smistring(
File "chemfp/openbabel_toolkit.py", line 930, in parse_smistring
id, mol = _smistring_fmt._format_config.parse_id_and_molecule(
File "chemfp/_openbabel_toolkit.py", line 1022, in parse_id_and_molecule
error_handler.error(f"Open Babel cannot parse the SMILES {myrepr(smi)}")
File "chemfp/io.py", line 146, in error
raise ParseError(msg, location)
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q'
The “ignore” value has the function return None, though Open Babel still prints its own warnings to stderr at the C++ level:
>>> fptype.from_smiles("Q", errors="ignore")
==============================
*** Open Babel Error in ParseSimple
SMILES string contains a character 'Q' which is invalid
Use the “suppress_location_output” context manager in the appropriate toolkit wrapper so these messages are not displayed:
>>> with fptype.toolkit.suppress_log_output():
... fp = fptype.from_smiles("Q", errors="ignore")
...
>>> fp is None
True
If the format is configurable, and its name is available as a string,
then use parse_molecule_fingerprint()
:
>>> fptype.parse_molecule_fingerprint("c1ccccc1O", "smistring")
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e'
Its parameters to are identical to the toolkit’s
parse_molecule()
function, including passing in
writer_args
and errors
.
See Parse and create SMILES for information about using
parse_molecule()
and the distinction between “smistring”, “smi”
and other SMILES formats. See Specify alternate error behavior for
more about the errors parameter.
Convert a structure record to an id and fingerprint¶
In this section you’ll learn how to use a fingerprint type to extract the id from a structure record, convert the structure record into a fingerprint, and return the (id, fingerprint) pair.
The previous section showed how to convert a structure record into a
fingerprint. Sometimes you’ll also want the identifier. The
FingerprintType
method
parse_id_and_molecule_fingerprint()
does both
in the same call.
>>> fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
>>> fptype.parse_id_and_molecule_fingerprint("c1ccccc1O phenol", "smi")
('phenol', b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00\x04\x00\x10\x1a')
(There is no shortcut available for this case.)
If the identifier is not present then the function may return None or the empty string, depending on the format and underlying implementation.
The parameters to parse_id_and_molecule_fingerprint
are identical
to the toolkit.parse_id_and_molecule()
function. For example,
the following shows the difference in using two different delimiter
types in the reader_args:
>>> record = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a"
>>> fptype.parse_id_and_molecule_fingerprint(record, "smi", reader_args={"delimiter": "to-eol"})
('vitamin a', b'\x00\x00\x00\x08\x00\x00\x02\x00\x02\n\x02\x80\x04\x98\x0c\x00\x00\x140\x14\x18')
>>> fptype.parse_id_and_molecule_fingerprint(record, "smi", reader_args={"delimiter": "space"})
('vitamin', b'\x00\x00\x00\x08\x00\x00\x02\x00\x02\n\x02\x80\x04\x98\x0c\x00\x00\x140\x14\x18')
The id_tag and errors parameters are also supported, though I won’t give examples. See Read ids and molecules using an SD tag for the id to learn how to use the id_tag and Specify a SMILES delimiter through reader_args and Multi-toolkit reader_args and writer_args for examples of using reader_args.
Make a specialized id and molecule fingerprint parser¶
In this section you’ll learn how to make a specialized function for computing the fingerprints given many individual structure records.
Sometimes the structure input comes as a set of individual strings, with one record per string. For example, the input might come from a database query, where the cursor returns each field of each row as its own term, and you want to convert each of them into a fingerprint.
One way to do this through successive calls to
FingerprintType.parse_molecule_fingerprint()
:
>>> import chemfp
>>>
>>> smiles_list = ["C", "O=O", "C#N"]
>>>
>>> fptype = chemfp.rdkit.maccs166
>>> #fptype = chemfp.get_fingerprint_type("RDKit-MACCS166") # alternative
>>>
>>> for smiles in smiles_list:
... fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
... print(fp.hex(), smiles)
...
000000000000000000000000000000000000008000 C
000000000000000000000000200000080000004008 O=O
000000000001000000000000000000000000000001 C#N
There is some overhead in this because the parameters, like format (“smistring” in this case) are (re)validated for each call, and sometimes extra work is done to ensure that the call is thread-safe. (The overhead is higher if there are complex reader args, and if the underlying fingerprinter is very fast.)
Another solution is to use
make_id_and_molecule_fingerprinter_parser()
to create a
specialized parser function for a given set of parameters. The
parameters are only validated once, and the returned parser function
takes only the record as input and returns the (id, fingerprint)
pair:
>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>> id_and_fp_parser = fptype.make_id_and_molecule_fingerprint_parser("smi")
>>> id_and_fp_parser("c1ccccc1O phenol")
('phenol', b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e')
The parameters to make_id_and_molecule_fingerprint_parser
are
identical to toolkit.make_id_and_molecule_parser()
.
I’ll use the new function to parse the smiles_list
from earlier:
>>> import chemfp
>>>
>>> smiles_list = ["C", "O=O", "C#N"]
>>>
>>> fptype = chemfp.rdkit.maccs166
>>> id_and_fp_parser = fptype.make_id_and_molecule_fingerprint_parser("smistring")
>>>
>>> for smiles in smiles_list:
... id, fp = id_and_fp_parser(smiles)
... print(bitops.hex_encode(fp), smiles)
...
000000000000000000000000000000000000008000 C
000000000000000000000000200000080000004008 O=O
000000000001000000000000000000000000000001 C#N
For OpenEye-MACCS166, creating and using a specialized parser is about 10% faster than using the parse_molecule_fingerprint() when the query is isocane (C20H42). For OpenBabel-MACCS it’s about 5%, for CDK-MACCS it’s slighly less than 5%, and for RDKit-MACCS166 it’s around 1%.
The performance differences are in part due to the performance
differences of the SMILES parsers in the underlying toolkit and in
part because of differences in how the toolkits handle parsing. Chemfp
does not guarantee that the function returned by
make_id_and_molecule_parser()
may be called by different threads
at the same time. (Instead, make a function for each thread.) This
means the OEChem version re-use a single molecule object, which
reduces some memory allocation overhead. While the RDKit and Open
Babel implementations always create a new molecule each time, adding
some overhead.
In addition, RDKit’s native MACCS implementation maps key 1 to bit 1, while the other toolkits and chemfp map key 1 to bit 0. Chemfp normalizes RDKit-MACCS by shifting all of the bits left, and this translation code hasn’t yet been optimized (though it appears to take only about 2% of the overall time).
You may have noticed that there’s a parse_molecule_fingerprint()
and a make_id_and_molecule_fingerprint_parser()
but there isn’t a
parse_id_and_molecule_fingerprint()
or
make_molecule_fingerprint_parser()
. This is simply a matter of
time. I haven’t needed those functions, they are quite easy to emulate
given what’s available, and I was getting bored of writing test cases.
Let me know if they would be useful for your code.
Read a structure file and compute fingerprints¶
In this section you’ll learn how to use a fingerprint type to read a structure file, compute fingerprints for each one, and iterate over the resulting (id, fingerprint) pairs. You will need Compound_099000001_099500000.sdf.gz from PubChem.
The read_molecule_fingerprints()
method of a
FingerprintType
reads a structure file and computes the
fingerprint for each molecule. It will also extract the record
identifier. It returns an iterator of the (id, fingerprint) pairs. For
example, the following uses OEChem/OEGraphSim to compute the MACCS166
fingerprint for a PubChem file, and prints the identifier, the number
of keys set in the fingerprint, and the hex-encoded fingerprint:
import chemfp
from chemfp import bitops
## Uncomment the fingerprint type you want to use.
fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
#fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
#fptype = chemfp.get_fingerprint_type("CDK-MACCS")
for id, fp in fptype.read_molecule_fingerprints("Compound_099000001_099500000.sdf.gz"):
print("%s %3d %s" % (id, bitops.byte_popcount(fp), bitops.hex_encode(fp)))
The first few lines of chemfp output are:
99000039 46 000004000000300001c0404e93e19053dca06b6e1b
99000230 67 000000880100648f0445a7fe2aeab1738f2a5b7e1b
99002251 45 00000000001132000088404985e01152dca46b7e1b
99003537 44 00000000200020000156149a90e994938c30592e1b
99003538 44 00000000200020000156149a90e994938c30592e1b
There’s also the top-level helper function
chemfp.read_molecule_fingerprints()
, which does the fingerprint
type lookup and the call to read_molecule_fingerprints
:
import chemfp
from chemfp import bitops
for id, fp in chemfp.read_molecule_fingerprints("CDK-MACCS",
"Compound_099000001_099500000.sdf.gz"):
print(f"{id} {bitops.byte_popcount(fp):3d} {fp.hex()}")
The helper function accepts both a type string, as shown here, or a
Metadata
instance with a value type
property.
The helper function does not support fingerprint kwargs, so in that case you have to go through the FingerprintType.
The read_molecule_fingerprints
method takes the same parameters as
the toolkit.read_ids_and_molecules()
, including id_tag,
errors, and location. I won’t cover those details again here.
Instead, see Read ids and molecules from an SD file at the same time.
Structure-based fingerprint reader location¶
In this section you’ll learn more about the location
attribute of
the structure-based fingerprint iterator returned by
read_molecule_fingerprints and read_molecule_fingerprints_from_string.
Four related functions implement structure-based fingerprint readers:
They all return a FingerprintIterator
. Just like with the
BaseMoleculeReader
classes, the FingerprintIterator has a
location
attribute that can be used to get more
information about the internal reader state. The toolkit section has
more details about how to get the current record number (see
Location information: filename, record_format, recno and output_recno) and, if supported by the parser implementation
for a format, the line number and byte ranges for the record (see
Location information: record position and content).
It’s also possible to get the current molecule object using the location’s “mol” attribute. This isn’t so important for the toolkit API since all of the molecule readers return the molecule object. It’s more useful in the fingerprint iterator, which doesn’t.
NOTE: accessing the molecule this way is somewhat slow, because it requires several Python function calls. It should mostly be used for error reporting; the following is meant as an example of use, and not a recommended best practice.
The following uses the location’s mol
to report the SMILES string
for every molecule whose MACCS fingerprint sets at most 6 keys:
import chemfp
from chemfp import bitops
from openeye.oechem import OECreateSmiString, OEThrow, OEErrorLevel_Fatal
OEThrow.SetLevel(OEErrorLevel_Fatal) # Disable warnings
fptype = chemfp.openeye.maccs166
#fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166") # alternative
with fptype.read_molecule_fingerprints("Compound_099000001_099500000.sdf.gz") as reader:
location = reader.location
for id, fp in reader:
popcount = bitops.byte_popcount(fp)
if popcount > 6:
continue
smiles = OECreateSmiString(location.mol)
print("%s %3d %s" % (id, popcount, smiles))
The output from the above is:
99116624 6 C(C(Cl)(Cl)Cl)(F)Cl
99116625 6 C(C(Cl)(Cl)Cl)(F)Cl
99118955 6 C(C(C(Cl)(Cl)Cl)(F)Cl)(C(F)(F)F)(F)F
99118956 6 C(C(C(Cl)(Cl)Cl)(F)Cl)(C(F)(F)F)(F)F
The above code imports the OEChem toolkit to disable warnings about “Stereochemistry corrected on atom number”, and to call OECreateSmiString directly.
While chemfp has no cross-platform method to silence warnings, it does have a cross-toolkit solution to generate the SMILES string, which is only slightly more complicated than using the native API.
I need to use the fingerprint type object to get the underlying “toolkit”, which is a portability layer on top of the actual cheminformatics toolkit with functions to parse a string into a molecule and vice versa:
>>> import chemfp
>>> fptype = chemfp.openeye.maccs166
>>> fptype.toolkit
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> T = fptype.toolkit
>>> mol = T.parse_smistring("OC")
>>> T.create_smistring(mol)
'CO'
There’s are also more generic methods for when you need to specify the record format as a string parameter:
>>> mol = T.parse_molecule("OC", "smistring")
>>> print(T.create_string(mol, "sdf", id="hello!"))
hello!
-OEChem-05162315572D
2 1 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
$$$$
I’ll use the toolkit’s create_smistring()
method to make the SMILES
string for each molecule which passes the filter:
import chemfp
from chemfp import bitops
fptype = chemfp.openeye.maccs166
#fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166") # alternative
T = fptype.toolkit
with fptype.read_molecule_fingerprints("Compound_099000001_099500000.sdf.gz") as reader:
location = reader.location
for id, fp in reader:
popcount = bitops.byte_popcount(fp)
if popcount > 6:
continue
smiles = T.create_smistring(location.mol)
print(f"{id} {popcount:3d} {smiles}")
When should you use a toolkit-specific API and when to use the portable one?
That depends on you. There’s definitely a portability vs. performance
tradeoff because the create_string
functions and its more
specialized variants like create_smistring
will always require an
extra function call over the native API. If you work with a given
toolkit a lot then you’re going to be more familiar with it than this
chemfp API. Plus, calling a function to create another function is
somewhat unusual.
On the other hand, it’s trivial to change the above code to work with any of the fingerprint types that chemfp supports.
Read fingerprints from a string containing structures¶
In this section you’ll learn how to use a fingerprint type to read a string containing a set of structure records, compute fingerprints for each one, and iterate over the resulting (id, fingerprint) pairs.
The read_molecule_fingerprints_from_string()
method of the FingerprintType
takes as input a string
containing structure records and returns an iterator over the (id,
fingerprint) pairs.
>>> import chemfp
>>> fptype = chemfp.openbabel.maccs
>>> # fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS") # alternative
>>> content = "C methane\n" + "CC ethane\n"
>>> print(content, end="")
C methane
CC ethane
>>> reader = fptype.read_molecule_fingerprints_from_string(content, "smi")
>>> for (id, fp) in reader:
... print(id, fp.hex())
...
methane 000000000000000000000000000000000000008000
ethane 000000000000000000000000000000000000108000
>>>
If you don’t have a type object you should use the top-level helper
function chemfp.read_molecule_fingerprints_from_string()
:
import chemfp
content = ("C methane\n"
"CC ethane\n")
reader = chemfp.read_molecule_fingerprints_from_string("OpenBabel-MACCS",
content, "smi")
for (id, fp) in reader:
print(id, fp.hex())
The helper function accepts both a type string, as shown here, and a
Metadata
object. A future version will support passing in the
fingerprint type itself.
The method takes the same parameters as
toolkit.read_ids_and_molecules_from_string()
, including the
id_tag, errors, location, and reader_args. See
Read from a string instead of a file for more about that function.
Structure-based fingerprint reader errors¶
In this section you’ll learn how to use the errors option for the “read molecule fingerprints” functions, including how to use the experimental support for a callback error handler.
The four structure reader functions
(chemfp.read_molecule_fingerprints()
,
chemfp.read_molecule_fingerprints_from_string()
,
FingerprintType.read_molecule_fingerprints()
, and
FingerprintType.read_molecule_fingerprints_from_string()
) take
the standard errors option. By default it is “strict”, which means
that it raises an exception when there are errors, and stops
processing.
>>> import chemfp
>>> content = ("C methane\n" +
... "Q Q-ane\n" +
... "O=O molecular oxygen\n")
>>> with chemfp.read_molecule_fingerprints_from_string(
... "RDKit-MACCS166", content, "smi") as reader:
... for (id, fp) in reader:
... print(id, fp.hex())
...
methane 000000000000000000000000000000000000008000
[17:21:59] SMILES Parse Error: syntax error while parsing: Q
[17:21:59] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
... traceback lines omitted ...
File "chemfp/io.py", line 139, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q', file '<string>', line 2, record #2: first line is 'Q Q-ane'
The default is “strict” because you should be the one to decide if you
really want to ignore errors, not me. Specify errors="ignore"
to
ignore errors, or use “report” to have chemfp write its own error
messages to stderr:
>>> with chemfp.read_molecule_fingerprints_from_string(
... "RDKit-MACCS166", content, "smi", errors="ignore") as reader:
... for (id, fp) in reader:
... print(id, fp.hex())
...
methane 000000000000000000000000000000000000008000
[17:23:46] SMILES Parse Error: syntax error while parsing: Q
[17:23:46] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
molecular oxygen 000000000000000000000000200000080000004008
Of course, this depends on the underlying toolkit implementation. Some toolkit/format combinations don’t let chemfp know there was an error, such as most of the OEChem-based formats.
Use your own error handler¶
In this section you’ll learn about the API for writing your own error handler.
In the previous section you learned about the “strict”, “report”, and
“ignore” error handlers. What if you want something different? The
errors object can be any object with the error
method
implementing the signature:
def error(self, msg, location=None):
...
You might send the results to a log file, or display it in a GUI, or send it to a speech synthesizer and hear all of the error messages go by.
The following creates an error handler which counts the number of errors, and for each one reports the error number, the filename (which is “<string>” if the input is from a string), and the error message:
>>> import chemfp
>>> class ErrorCounter(object):
... def __init__(self):
... self.num_errors = 0
... def error(self, msg, location=None):
... self.num_errors += 1
... print(f"Failure #{self.num_errors} from file {location.filename}: {msg}")
...
>>> error_handler = ErrorCounter()
>>> content = ("C methane\n" +
... "Q Q-ane\n" +
... "O=O molecular oxygen\n")
>>> with chemfp.rdkit.maccs166.read_molecule_fingerprints_from_string(
... "RDKit-MACCS166", content, "smi", errors=error_handler) as reader:
... for (id, fp) in reader:
... print(id, fp.hex())
...
methane 000000000000000000000000000000000000008000
[17:38:52] SMILES Parse Error: syntax error while parsing: Q
[17:38:52] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Failure #1 from file <string>: RDKit cannot parse the SMILES 'Q'
molecular oxygen 000000000000000000000000200000080000004008
Let me know if you use the API and have ideas for improvements.
The toolkit documentation includes another example of how to write an error handler.
Compute a fingerprint for a native toolkit molecule¶
In this section you’ll learn how to compute a fingerprint given a toolkit molecule.
All of the previous sections assumed the inputs were structure
record(s), either as a string or from a file. What if you already have
a native toolkit molecule and want to compute its fingerprint? In
that case, use the FingerprintType.from_mol()
method:
>>> import chemfp
>>> fptype = chemfp.openbabel.maccs
>>> mol = fptype.toolkit.parse_smistring("c1ccccc1O")
>>> mol
<openbabel.openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x10b134db0> >
>>> fptype.from_mol(mol)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e'
This can be useful when you want to compute multiple fingerprint types for the same molecule. For example, I’ll compare Open Babel’s MACCS implementation with chemfp’s own MACCS implementation for Open Babel:
import chemfp
from chemfp import openbabel_toolkit as T
from chemfp import bitops
# There will not be a "chemfp.openbabel.rdmaccs" way to
# get the RDMACCS-OpenBabel type, so I'll refer to both by name.
fptype1 = chemfp.get_fingerprint_type("OpenBabel-MACCS")
fptype2 = chemfp.get_fingerprint_type("RDMACCS-OpenBabel")
with T.read_ids_and_molecules("Compound_099000001_099500000.sdf.gz") as reader:
for id, mol in reader:
fp1 = fptype1.from_mol(mol)
fp2 = fptype2.from_mol(mol)
if fp1 != fp2:
bits1 = set(bitops.byte_to_bitlist(fp1))
bits2 = set(bitops.byte_to_bitlist(fp2))
print(id, "in OB:", sorted(bits1-bits2), "in RDMACCS:", sorted(bits2-bits1))
else:
print(id, "equal")
Most (8151 of 10969) of the output were lines of the form:
99000039 in OB: [] in RDMACCS: [124]
I was curious, so I investigated the differences. Key 125 (the MACCS keys start at 1 while chemfp bit indexing starts at 0) is defined as “Aromatic Ring > 1”. Open Babel doesn’t support this bit because it only allows key definitions based on SMARTS, and this query cannot be represented as SMARTS, while chemfp’s RDMACCS has special support for that bit that uses the Open Babel API.
Note: from_mol()
is thread-safe. If an underlying
chemistry toolkit object is not thread-safe then chemfp will duplicate
that object before computing the fingerprint.
Fingerprint many native toolkit molecules¶
In this section you’ll learn how to generate a fingerprint given many native toolkit molecules.
Sometimes you have a list of molecules and you want to compute fingerprints for each one. In the following I’ll load 10826 molecules from an SD file using OEChem:
>>> import chemfp
>>>
>>> fptype = chemfp.openeye.maccs166
>>> T = fptype.toolkit
>>>
>>> with T.read_molecules("Compound_099000001_099500000.sdf.gz") as reader:
... mols = [T.copy_molecule(mol) for mol in reader]
...
... various OEChem warnings omitted ...
>>> len(mols)
10740
NOTE: for performance reasons, some of the toolkit implementations
will reuse a molecule object. I call toolkit.copy_molecule()
to
force a copy of each one. A future version of chemfp will likely
support a new reader_args parameter to ask the reader implementation
to always return a new molecule.
You know from the previous section how to compute the fingerprint one
molecule at a time using
FingerprintType.from_mol()
:
>>> fps = [fptype.from_mol(mol) for mol in mols]
You can also process all of them at once using
FingerprintType.from_mols()
:
>>> fps = list(fptype.from_mols(mols))
The plural in the name from_mols()
is the hint that it
can take multiple molecules. It returns a generator, so I used
Python’s list()
to convert it to an actual list.
Why call from_mols
instead of from_mol
?
The main reason is that it expresses your intent more clearly than
setting up a for-loop. But to be honest, the original reason was that
I expected it would be faster than calling the from_mol
many times, because the underlying code could skip some overhead.
By design, from_mol
is thread-safe, which means chemfp sometimes
makes extra objects to keep that promise. On the other hand,
from_mols
, which processes a sequential series of molecules, can
reuse internal objects across the series instead of creating new
ones. In principle this should be a bit faster. In practice, nearly
all of the time is spent in generating the fingerprints. The overhead
adds less than 1%.
Make a specialized molecule fingerprinter¶
In this section you’ll learn how to make a specialized function to compute a fingerprint for a molecule. However, there is very little reason for you to use this function.
The FingerprintType.compute_fingerprint()
method is
thread-safe. Some of the underlying toolkit implementations can use
code which isn’t thread-safe. For example, OEGraphSim writes its
fingerprint information to an OEFingerPrint instance, and replaces its
previous value. A thread-safe implementation would make a new
OEFingerPrint for each call, which a non-thread-safe implementation
could reuse it, and save a small bit of allocation overhead.
The FingerprintType.make_fingerprinter()
method returns a
non-thread-safe fingerprinter function, which is potentially faster
beause it doesn’t need to keep the thread-safe promise.
Here’s an example of the two APIs. First, a bit of preamble to get things set up with a couple of molecules:
>>> import chemfp
>>> from chemfp import bitops
>>>
>>> fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> mol1 = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> mol2 = fptype.toolkit.parse_molecule("O=O", "smistring")
The thread-safe API calls the from_mol()
method:
>>> bitops.byte_popcount(fptype.from_mol(mol1))
12
>>> bitops.byte_popcount(fptype.from_mol(mol2))
1
The non-thread-safe version uses make_fingerprinter
to create a
new fingerprinter function, which I’ve assigned to calc_fingerprint,
and then call directly:
>>> calc_fingerprint = fptype.make_fingerprinter()
>>> bitops.byte_popcount(calc_fingerprint(mol1))
12
>>> bitops.byte_popcount(calc_fingerprint(mol2))
1
The keen-eyed will note that I could have written the first code as:
>>> from_mol = fptype.from_mol
>>> bitops.byte_popcount(from_mol(mol1))
12
>>> bitops.byte_popcount(from_mol(mol2))
1
and gotten the same answer, which means there is little API need for a special “make_fingerprinter()” function, except for performance.
I timed the performance differences using the following:
import chemfp
import time
def main():
fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
T = fptype.toolkit
with T.read_molecules("Compound_099000001_099500000.sdf.gz") as reader:
mols = list(reader)
from_mol = fptype.from_mol
calc_fingerprint = fptype.make_fingerprinter()
t1 = time.time()
fps1 = [from_mol(mol) for mol in mols]
t2 = time.time()
fps2 = [calc_fingerprint(mol) for mol in mols]
t3 = time.time()
assert fps1 == fps2
print("from_mol():", t2-t1)
print("make_fingerprinter():", t3-t2)
print("ratio:", (t2-t1)/(t3-t2))
print("1/ratio:", (t3-t2)/(t2-t1))
main()
With the Open Babel 3.1.0 fingerprints and Python 3.11 the make_fingerprinter() was about 2% faster.