chemfp.countops module
This module implements count fingerprints and associated operations.
The API is currently experimental and not fully stable. This means the next release may have breaking changes, without the migration policy which applies to the stable portions of the chemfp API.
In particular, parameter names like “count_fp” may renamed to “fp” or vice versa. The current scheme is inconsistent.
This API is meant for early (and hopefully friendly!) users to experiment with the API and provide feedback. Let me know about your experience.
A count fingerprint contains zero or more features. It is a sparse representation which only stores information about the features with a non-zero count. Each feature has an index and a count. Features are in ordered by increasing index. Count fingerprints are immutable. The current implementation uses an unsigned 64-bit integer for the index and an unsigned 32-bit integer for the count, though this detail may change in the future.
The CountFingerprint has a list-like API:
>>> from chemfp import countops
>>> fp = countops.CountFingerprint.from_features(
... [(1,2), (3,1), (8, 4)])
>>> list(fp)
[(1, 2), (3, 1), (8, 4)]
>>> len(fp)
3
>>> fp[1]
(3, 1)
Its repr() shows the number of features, its str() is the fingerprint as an FPC-encoded string, and there’s a way to get the sum of the counts:
>>> fp
CountFingerprint(#features=3)
>>> str(fp)
'1:2,3,8:4'
>>> fp.get_total_count()
7
There are also ways to access data with ctypes, NumPy, and Pandas:
>>> fp.as_ctypes()
<__main__.CountFeature_64_32_Array_3 object at 0x107aa2f30>
>>> fp.as_ctypes()[0]
CountFeature_64_32(index=1, count=2)
>>> fp.as_numpy_array()
array([(1, 2), (3, 1), (8, 4)], dtype=[('index', '<u8'), ('count', '<u4')])
>>> fp.as_numpy_array()["index"]
array([1, 3, 8], dtype=uint64)
>>> fp.to_pandas()
index count
0 1 2
1 3 1
2 8 4
There are ways to create a CountFingerprint from an FPC string, with or without the id:
>>> countops.parse_fpc("1:2,3,4:5\tID123")
CountFingerprint(#features=3)
>>> countops.parse_id_and_fpc("1:2,3,4:5\tID123")
('ID123', CountFingerprint(#features=3))
There are also functions to convert a count fingerprint into a byte fingerprint:
>>> countops.create_folded_bytes(fp, 64)
b'\n\x01\x00\x00\x00\x00\x00\x00'
>>> countops.create_superimposed_bytes(fp, 64)
b'\xc0\x04\x10\x00\x80@\x00\x80'
For more advanced uses, see the CountConverter classes
FoldedCountConverter and SuperimposedCountConverter.
Finally, there are functions to work with the binary string representation of RDKit’s count fingerprints:
>>> from rdkit import DataStructs
>>> fp = DataStructs.ULongSparseIntVect(2**35)
>>> fp[2**32] = 100
>>> fp.ToBinary()[:10]
b'\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00'
>>> countops.parse_rdkit_binary(fp.ToBinary())
CountFingerprint(#features=1)
>>> list(countops.parse_rdkit_binary(fp.ToBinary()))
[(4294967296, 100)]
>>> countops.parse_rdkit_binary_header(fp.ToBinary())
RDKitBinaryHeader(version=1, index_size=8, num_bits=34359738368,
num_features=1)
as well as to create those strings from chemfp’s count fingerprint:
>>> fp = countops.parse_fpc("1:2,3,4:5")
>>> dict(fp)
{1: 2, 3: 1, 4: 5}
>>> rdk_bin = countops.create_rdkit_binary_UIntSparseIntVect(fp)
>>> from rdkit import DataStructs
>>> rdk_fp = DataStructs.UIntSparseIntVect(rdk_bin)
>>> rdk_fp.GetNonzeroElements()
{1: 2, 3: 1, 4: 5}
Core count fingerprint datatypes
- class chemfp.countops.CountFeature_64_32
A feature in a chemfp count fingerprint
This is a direct view of the underlying data, which is an unsigned 64-bit integer for the index and an unsigned 32-bit integer for the count.
DO NOT MODIFY ITS VALUES. Doing so will likely break internal invariants.
- index: int
The feature index.
- count: int
The feature count.
- class chemfp.countops.CountFingerprint
chemfp’s count fingerprint data type
A count fingerprint contains a list of sparse features. Each feature has an index and a count.
Use
parse_fpc()orparse_id_and_fpc()to create a CountFingerprint from a string containing and FPC-encoded count fingerprint.Use
parse_rdkit_binary()to create a CountFingerprint from a byte string containing the “ToBinary()” output from an RDKit count fingerprint.Use the class method
CountFingerprint.from_features()to create a CountFingerprint from an iterator of (index, count) pairs.A count fingerprint is immutable. It cannot be modified.
- GetNonzeroElements() Dict[int, int]
An alias for dict(self)
An experimental migration feature which exists for compatibility with RDKit. It may be removed in the future. You should use
dict(self).- Returns:
a dictionary mapping index to count
- GetTotalValue() int
An alias for
CountFingerprint.get_total_count()Returns the sum of the feature counts.
An experimental migration feature which exists for compatibility with RDKit. It may be removed in the future. You should use
get_total_count().- Returns:
an integer
- __eq__()
Return True if the two lists of features are identical
- __getitem__()
Return an (index, count) pair or list of such pairs
- __iter__()
Iterate through the features as (index, count) pairs
- __len__()
Return the number of features
- __repr__()
Return a string like ‘CountFingerprint(#features=3)’
- __str__()
Return the fingerprint as a FPC-encoded string
- as_ctypes()
Return a ctype view of the underlying features.
Each (index, count) pair is represented as a ctypes structure
CountFeature_64_32with fields index (c_uint64) and count (c_uint32).This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use
as_numpy_array()instead.
- as_numpy_array() _typing.NumPyArray
Get a view of the features as a NumPy array
- classmethod from_features(features: _typing.Iterable[int, int]) CountFingerprint
Create a CountFingerprint from an iterable of (index, count pairs)
The indices must be in increasing order from 0 to 2**64-1. Each count must be a positive integer in the range 1 to 2**32-1.
- Returns:
- get_total_count() int
Return the sum of the feature counts
- to_pandas(*, columns: tuple[str, str] | _typing.Sequence[str] = ('index', 'count')) _typing.PandasDataFrame
Return the feature indices and counts as a Pandas DataFrame
The first column contains the indices, and the second column contains the counts. The default column headers are “index” and “count”. Use columns to specify different header.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
Parse a string to create a count fingerprint
- class chemfp.countops.FPCParseError(errno: int, msg: str, byte_offset: int, location: _Location = None)
Exception type used when parsing an FPC-encoded fingerprint or line.
The public attributes are:
- errno: int
The chemfp error code used at the C level.
- msg: int
The chemfp error message for the error code.
- byte_offset: int
The offset to the byte which caused the error, or to the end of line if the string is incomplete. If the input is a text string then this is the offset in the UTF-8 encoded byte string.
- location: chemfp.io.Location
The input string is stored in the
Location’s “record” attribute.
- __str__() str
Describe the error as a human-readable string
- chemfp.countops.parse_fpc(s: str | bytes, stop_at_tab: bool = True) CountFingerprint
Parse an FPC-encoded count fingerprint as a count fingerprint.
The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.
Example:
>>> from chemfp import countops >>> countops.parse_fpc("1,2:3\tID0001\n") CountFingerprint(#features=2)
- Parameters:
s (str or bytes) – An FPC-encoded fingerprint
stop_at_tab (bool) – If True, stop parsing at the first tab
- Returns:
- chemfp.countops.parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, CountFingerprint]
Parse an FPC-encoded count fingerprint as an id and count fingerprint.
The input can be a text or byte string. By default the terminal newline may be present but is required. If require_newline is True then the terminal newline must be present.
Example:
>>> from chemfp import countops >>> countops.parse_id_and_fpc("1,2:3\tID0001\n") ('ID0001', CountFingerprint(#features=2))
- Parameters:
s (str or bytes) – A string containing a fingerprint line from an FPC file.
require_newline (bool) – If True, the string must end with a newline.
- Returns:
a tuple of id string and
CountFingerprint
- chemfp.countops.parse_fpc_as_dict(s: str | bytes, stop_at_tab: bool = True) Dict[int, int]
Parse an FPC-encoded count fingerprint as a count dictionary.
The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.
Example:
>>> from chemfp import countops >>> countops.parse_fpc_as_dict("1,2:3\tID0001\n") {1: 1, 2: 3}
- Parameters:
s (str or bytes) – An FPC-encoded fingerprint
stop_at_tab (bool) – If True, stop parsing at the first tab
- Returns:
a dict
- chemfp.countops.parse_id_and_fpc_as_dict(s: str | bytes, require_newline: bool = False) tuple[str, Dict[int, int]]
Parse a line from an FPC file as the id and count dictionary.
The input can be a text or byte string. By default the terminal newline may be present but is required. If require_newline is True then the terminal newline must be present.
Example:
>>> from chemfp import countops >>> countops.parse_id_and_fpc_as_dict("1,2:3\tID0001\n") ('ID0001', {1: 1, 2: 3})
- Parameters:
s (str or bytes) – A string containing a fingerprint line from an FPC file.
require_newline (bool) – If True, the string must end with a newline.
- Returns:
a tuple of id string and count dictionary
- chemfp.countops.parse_rdkit_binary(rdkit_binary: bytes) CountFingerprint
Convert the RDKit fingerprint bytes to a chemfp count fingerprint.
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
Example:
>>> from rdkit.DataStructs import UIntSparseIntVect >>> rdk_fp = UIntSparseIntVect(2**30) >>> rdk_fp[100] = 8 >>> rdk_fp.ToBinary()[:8] b'\x01\x00\x00\x00\x04\x00\x00\x00` >>> from chemfp import countops >>> chemfp_fp = countops.parse_rdkit_binary(rdk_fp.ToBinary()) >>> chemfp_fp CountFingerprint(#features=1) >>> list(chemfp_fp) [(100, 8)]
NOTE: the indices in the binary string must be in increasing order and have a non-zero count. If not, the returned count fingerprint may be invalid. A future implementation may validate the input.
- Parameters:
rdkit_binary (bytes) – An RDKit fingerprint binary string.
- Returns:
Create a string given a count fingerprint
- chemfp.countops.create_fpc(id: str, fp: CountFingerprint) str
Return the id and fingerprint as a line for an FPC file
This is the same as using
f"{fp}\t{id}\n"except that this function will raise a ValueError if the id contains a newline, control-return, tab, or NUL.- Parameters:
id (a text string) – the record identifier
fp (a
CountFingerprint) – a chemfp count fingerprint
- Returns:
the record as a line for an FPC file
- chemfp.countops.create_fpcstring(fp: CountFingerprint) str
Return the fingerprint as a FPC-encoded string
This is the same as using
str(fp).- Parameters:
fp (a
CountFingerprint) – a chemfp count fingerprint- Returns:
the count fingerprint features as an FPC-encoded string
- chemfp.countops.create_rdkit_binary_UIntSparseIntVect(fp: CountFingerprint, num_bits=4294967295) bytes
Convert a count fingerprint to an RDKit UIntSparseIntVect binary string
The returned byte string can be passed to the UIntSparseIntVect constructor to create the corresponding RDKit fingerprint. For example:
>>> from chemfp import countops >>> from rdkit import DataStructs >>> fp = countops.CountFingerprint.from_features([(3, 5)]) >>> rdk_bin = countops.create_rdkit_binary_UIntSparseIntVect(fp, 16) >>> rdk_bin[:12] b'\x00\x00\x00\x04\x00\x00\x00\x10\x00\x00\x00' >>> rdk_fp = DataStructs.UIntSparseIntVect(rdk_bin) >>> rdk_fp.GetNonzeroElements() {3: 5}
Each RDKit fingerprint stores its maximum allowed index, specified with num_bits.
If num_bits is smaller than 2**32-1 then every feature index must be less than num_bits. If num_bits is exactly 2**32-1 then the feature index 2**32-1 is also allowed.
If an index is too large then this function raises a ValueError.
- Parameters:
fp (a
CountFingerprint) – a chemfp count fingerprintnum_bits (int) – the maximum number of bits for the RDKit fingerprint
- Returns:
a byte string used to create a UIntSparseIntVect
- chemfp.countops.create_rdkit_binary_ULongSparseIntVect(fp: CountFingerprint, num_bits=18446744073709551615) bytes
Convert a count fingerprint to an RDKit ULongSparseIntVect binary string
The returned byte string can be passed to the ULongSparseIntVect constructor to create the corresponding RDKit fingerprint. For example:
>>> from chemfp import countops >>> from rdkit import DataStructs >>> fp = countops.CountFingerprint.from_features([(3, 5)]) >>> rdk_bin = countops.create_rdkit_binary_ULongSparseIntVect(fp, 16) >>> rdk_bin[:12] b'\x01\x00\x00\x00\x08\x00\x00\x00\x10\x00\x00\x00' >>> rdk_fp = DataStructs.ULongSparseIntVect(rdk_bin) >>> rdk_fp.GetNonzeroElements() {3: 5}
Each RDKit fingerprint stores its maximum allowed index, specified with num_bits.
If num_bits is smaller than 2**64-1 then every feature index must be less than num_bits. If num_bits is exactly 2**64-1 then the feature index 2**64-1 is also allowed.
If an index is too large then this function raises a ValueError.
- Parameters:
fp (a
CountFingerprint) – a chemfp count fingerprintnum_bits (int) – the maximum number of bits for the RDKit fingerprint
- Returns:
a byte string used to create a ULongSparseIntVect
Convert a count fingerprint to a byte fingerprint
- class chemfp.countops.CountConverter
Base class for converters from count fingerprint to byte fingerprints
A CountConverter is a base class which cannot be used directly. Instead, use
FoldedCountConverterorSuperimposedCountConverteror derive your own subclass and implement “get_type” and “create_bytes”.Every converter has the attribute:
- num_bits: positive integer
The number of bits in the output fingerprint.
- create_bytes(fp: CountFingerprint) bytes
Convert a chemfp count fingerprint to a byte string
Calling the method on the base class will raise a NotImplementedError. It must be implemented in the subclass.
- get_type() str
Return a canonical description of this converter as a string
This should be appropriate for the “type” metadata field.
Calling the method on the base class will raise a NotImplementedError. It must be implemented in the subclass.
- parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes
Parse an FPC-encoded count fingerprint to get a byte fingerprint
The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.
This is equivalent to using
countops.parse_fpc()to parse the input FPC string to get aCountFingerprintthen converting that fingerprint to bytes withCountConverter.create_bytes().- Parameters:
s (str or bytes) – A string containing an FPC-encoded fingerprint
stop_at_tab (bool) – If True, stop parsing at the first tab
- Returns:
bytes
- parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]
Parse a line from an FPC file to get the id and converted byte fingerprint
The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.
This is equivalent to using
countops.parse_id_and_fpc()to parse the input FPC string to get the id and aCountFingerprintthen converting that fingerprint to bytes withCountConverter.create_bytes().- Parameters:
s (str or bytes) – A string containing a fingerprint line from an FPC file.
require_newline (bool) – If True, the string must end with a newline.
- Returns:
a tuple of id string and bytes
- parse_rdkit_binary(rdkit_binary: bytes)
Parse an RDKit fingerprint binary string and convert to a byte fingerprint.
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
The result is a binary fingerprint as a byte string.
- Parameters:
rdkit_binary (bytes) – An RDKit fingerprint binary string.
- Returns:
bytes
- class chemfp.countops.FoldedCountConverter
Convert a count fingerprint to a binary fingerprint using folding.
This converter implements modulo folding based on each feature index. The count is ignored.
- num_bits: positive integer
The number of bits in the output fingerprint.
- hash: bool
If True, use the index to seed a PRNG and generate the value used to fold, instead of folding on the index. Use this if the index values are not not well distributed after folding.
For example, if the index contains atom type triplets with 5 bits for atom 1, 5 bits for atom 2, and 5 bits for atom 3, then modulo folding on 1024 bits is the same as using the 10 bits for atoms 1 and 2 while ignoring the 5 bits for atom 3.
- create_bytes(fp: CountFingerprint) bytes
Convert a chemfp count fingerprint to a folded binary fingerprint
- get_type() str
Return a canonical type string for the folded parameters
- parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes
Parse an FPC-encoded count fingerprint as folded bytes.
The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.
The FPC-encoded count fingerprint is folded directly to byte fingerprint.
- Parameters:
s (str or bytes) – A string containing an FPC-encoded fingerprint
stop_at_tab (bool) – If True, stop parsing at the first tab
- Returns:
bytes
- parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]
Parse a line from an FPC file as the id and folded byte fingerprint
The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.
The implementation parses the input FPC string directly to extract the id and create the folded fingerprint bytes.
- Parameters:
s (str or bytes) – A string containing a fingerprint line from an FPC file.
require_newline (bool) – If True, the string must end with a newline.
- Returns:
a tuple of id string and bytes
- parse_rdkit_binary(rdkit_binary: bytes)
Parse an RDKit fingerprint binary string and convert to a folded fingerprint.
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
The RDKit count fingerprint is folded directly to a byte fingerprint.
- Parameters:
rdkit_binary (bytes) – An RDKit fingerprint binary string.
- Returns:
bytes
- class chemfp.countops.SuperimposedCountConverter
Convert a count fingerprint to a binary fingerprint using random superimposed coding.
Each feature has an index and a count. For each feature, the index is used to seed a pseudo-random number generator to generates min(count, max_count)*bits_per_count random values from 0 to num_bits-1 (duplicates are allowed).
Using bits_per_count values > 1 appears to make the Tanimoto scores less reliable.
If the on-bit density of the output binary fingerprint is low (~5% or less) then the binary Tanimoto between two output fingerprints is a good approximation to the count Tanimoto between the two input count fingerprints.
- num_bits: positive integer
The number of bits in the output fingerprint.
- bits_per_count: positive integer
The number of bits to set for each count of each feature. The default is 1. It does not seem useful to set more than 1 bit per count.
- max_count: positive integer
If the feature count is larger than max_count then use max_count instead. The largest possible value is 2**32-1. The default is 1000, which prevent features like 0:1000000000 from generating 1 billion random values.
- create_bytes(fp: CountFingerprint) bytes
Convert a chemfp count fingerprint to a superimposed binary fingerprint
- get_type() str
Return a canonical type string for the superimposed parameters
- parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes
Parse an FPC-encoded count fingerprint as superimposed bytes.
The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.
- Parameters:
s (str or bytes) – An FPC-encoded fingerprint
stop_at_tab (bool) – If True, stop parsing at the first tab
- Returns:
bytes
- parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]
Parse a line from an FPC file as the id and superimposed byte fingerprint
The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.
The implementation parses the input FPC string directly to extract the id and create the superimposed fingerprint bytes.
- Parameters:
s (str or bytes) – A string containing a fingerprint line from an FPC file.
require_newline (bool) – If True, the string must end with a newline.
- Returns:
a tuple of id string and bytes
- parse_rdkit_binary(rdkit_binary: bytes) bytes
Parse an RDKit fingerprint binary string to get the superimposed binary fingerprint
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
The RDKit count fingerprint binary is superimposed directly to a byte fingerprint.
- Parameters:
rdkit_binary (bytes) – An RDKit fingerprint binary string.
- Returns:
bytes
- chemfp.countops.create_folded_bytes(fp: CountFingerprint, num_bits: int = 2048, hash: bool = False) bytes
Convert a chemfp count fingerprint to a folded byte fingerprint
For each feature index, modulo fold it to num_bits and set the corresponding bit in the output byte fingerprint to 1.
If hash is True, use the index to seed a PRNG and generate the value used to fold, instead of folding on the index. Use this if the index values are not not well distributed after folding.
This is equivalent to:
FoldedCountConverter(num_bits, hash).create_bytes(fp)
- Parameters:
fp (a
CountFingerprint) – a count fingerprintnum_bits (a positive integer) – the number of bits for the output fingerprint
hash (either True or False) – if True, fold a hash of the index instead of the index
- Returns:
the folded fingerprint as a byte string
- chemfp.countops.create_superimposed_bytes(fp: CountFingerprint, num_bits=2048, bits_per_count=1, max_count=1000) bytes
Superimpose a count fingerprint to a byte fingerprint
Used random superimposed coding to convert the
CountFingerprintto a byte fingerprint with num_bits bits.Each feature has an index and a count. For each feature, the index is used to seed a pseudo-random number generator to generates min(count, max_count)*bits_per_count random values from 0 to num_bits-1 (duplicates are allowed).
Using bits_per_count values > 1 appears to make the Tanimoto scores less reliable.
This is equivalent to:
SuperimposedCountConverter( num_bits, bits_per_count, max_count).create_bytes(fp)
- Parameters:
fp (a
CountFingerprint) – a chemfp count fingerprintnum_bits (a positive integer) – the number of bits in the output byte fingerprint
bits_per_count (a positive integer (should be 1)) – a multiplier to the number of generated values
max_count (a positive integer) – the upper bound for the count to use
- Returns:
Functions which work on count fingerprints
- chemfp.countops.count_tanimoto(fp1: CountFingerprint, fp2: CountFingerprint) float
Compute the Tanimoto between two chemfp count fingerprints
This computes the multiset Tanimoto, defined as the sum of the minimum counts for the indices in common, divided by the sum of the maximum counts for both indices.
If there are no features then the Tanimoto is 0.0.
Example:
>>> from chemfp import countops >>> fp1 = countops.parse_fpc("1:1,2:8") >>> fp2 = countops.parse_fpc("1:3,3:5") >>> countops.count_tanimoto(fp1, fp2) 0.0625 >>>
- Parameters:
fp1 (a
CountFingerprint) – a chemfp count fingerprintfp2 (a
CountFingerprint) – a second chemfp count fingerprint
- Returns:
the Tanimoto as a float
- chemfp.countops.dict_tanimoto(fp1: Dict[int, int], fp2: Dict[int, int]) float
Compute the Tanimoto between two count dictionaries
A count dictionary is a dictionary mapping an integer index to an integer count.
This computes the multiset Tanimoto, defined as the sum of the minimum counts for the indices in common, divided by the sum of the maximum counts for both indices.
If there are no features then the Tanimoto is 0.0.
Example:
>>> from chemfp import countops >>> countops.dict_tanimoto({1: 1, 2: 8}, {1: 3, 3: 5}) 0.0625 >>>
- Parameters:
fp1 (a count dictionary) – a count fingerprint represented as a dictionary
fp2 (a count dictionary) – a second count fingerprint represented as a dictionary
- Returns:
the Tanimoto as a float
Work with RDKit fingerprint binary strings
- class chemfp.countops.RDKitBinaryHeader(version: int, index_size: int, num_bits: int, num_features: int)
Header information for an RDKit sparse fingerprint binary string
These are strings generated by any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
The public attributes are:
- version
The version number. Only version 1 is supported.
- index_size
The number of bytes for the feature index. Only 4 and 8 are supported.
- num_bits
The number of bits in the sparse fingerprint.
- num_features
The number of features in the sparse fingerprint.
- get_feature_size() int
Return the number of bytes used for each feature record
- get_header_size() int
Return the number of bytes needed for the header
- get_total_size() int
Return the number of bytes needed for the entire string
- class chemfp.countops.parse_rdkit_binary_header(rdkit_binary: bytes, check_total_size: bool = False)
Parse the header information from the ToBytes() of an RDKit fingerprint.
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
Example:
>>> from rdkit.DataStructs import ULongSparseIntVect >>> rdk_fp = ULongSparseIntVect(2**42) >>> rdk_fp[2**41+100] = 12345 >>> from chemfp import countops >>> countops.parse_rdkit_binary_header(rdk_fp.ToBinary()) RDKitBinaryHeader(version=1, index_size=8, num_bits=4398046511104, num_features=1)
The header contains a version, the number of byte per index, the maximum number of bits for a fingerprint, and the total number of features.
By default only parse and validate the header values. Must be version 1 with an index size of either 4 or 8.
If check_total_size is True then also check that the the length of the rdkit_binary byte string is correct. The default is False.
- Parameters:
rdkit_binary (bytes) – An RDKit fingerprint binary string.
check_total_size (bool) – If True, check if rdkit_binary is the right size
- Returns:
- chemfp.countops.create_fpcstring_from_rdkit_binary(rdkit_binary: bytes) str
Convert an RDKit fingerprint binary string to an FPC-encoded fingerprint
This can parse the byte string returned from any of RDKit’s:
IntSparseIntVect().ToBytes()
UIntSparseIntVect().ToBytes()
LongSparseIntVect().ToBytes()
ULongSparseIntVect().ToBytes()
Example:
>>> from rdkit.DataStructs import ULongSparseIntVect >>> rdk_fp = ULongSparseIntVect(2**42) >>> rdk_fp[1_000_000_000_000] = 1912 >>> from chemfp import countops >>> countops.create_fpcstring_from_rdkit_binary(rdk_fp.ToBinary()) '1000000000000:1912'
NOTE: the indices in the binary string must be in increasing order and have a non-zero count. If not, the returned string may be an invalid FPC string. A future implementation may validate the input.
- Returns:
a string in FPC format