chemfp.countops module

This module implements count fingerprints and associated operations.

The API is currently experimental and not fully stable. This means the next release may have breaking changes, without the migration policy which applies to the stable portions of the chemfp API.

In particular, parameter names like “count_fp” may renamed to “fp” or vice versa. The current scheme is inconsistent.

This API is meant for early (and hopefully friendly!) users to experiment with the API and provide feedback. Let me know about your experience.

A count fingerprint contains zero or more features. It is a sparse representation which only stores information about the features with a non-zero count. Each feature has an index and a count. Features are in ordered by increasing index. Count fingerprints are immutable. The current implementation uses an unsigned 64-bit integer for the index and an unsigned 32-bit integer for the count, though this detail may change in the future.

The CountFingerprint has a list-like API:

>>> from chemfp import countops
>>> fp = countops.CountFingerprint.from_features(
...    [(1,2), (3,1), (8, 4)])
>>> list(fp)
[(1, 2), (3, 1), (8, 4)]
>>> len(fp)
3
>>> fp[1]
(3, 1)

Its repr() shows the number of features, its str() is the fingerprint as an FPC-encoded string, and there’s a way to get the sum of the counts:

>>> fp
CountFingerprint(#features=3)
>>> str(fp)
'1:2,3,8:4'
>>> fp.get_total_count()
7

There are also ways to access data with ctypes, NumPy, and Pandas:

>>> fp.as_ctypes()
<__main__.CountFeature_64_32_Array_3 object at 0x107aa2f30>
>>> fp.as_ctypes()[0]
CountFeature_64_32(index=1, count=2)
>>> fp.as_numpy_array()
array([(1, 2), (3, 1), (8, 4)], dtype=[('index', '<u8'), ('count', '<u4')])
>>> fp.as_numpy_array()["index"]
array([1, 3, 8], dtype=uint64)
>>> fp.to_pandas()
   index  count
0      1      2
1      3      1
2      8      4

There are ways to create a CountFingerprint from an FPC string, with or without the id:

>>> countops.parse_fpc("1:2,3,4:5\tID123")
CountFingerprint(#features=3)
>>> countops.parse_id_and_fpc("1:2,3,4:5\tID123")
('ID123', CountFingerprint(#features=3))

There are also functions to convert a count fingerprint into a byte fingerprint:

>>> countops.create_folded_bytes(fp, 64)
b'\n\x01\x00\x00\x00\x00\x00\x00'
>>> countops.create_superimposed_bytes(fp, 64)
b'\xc0\x04\x10\x00\x80@\x00\x80'

For more advanced uses, see the CountConverter classes FoldedCountConverter and SuperimposedCountConverter.

Finally, there are functions to work with the binary string representation of RDKit’s count fingerprints:

>>> from rdkit import DataStructs
>>> fp = DataStructs.ULongSparseIntVect(2**35)
>>> fp[2**32] = 100
>>> fp.ToBinary()[:10]
b'\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00'
>>> countops.parse_rdkit_binary(fp.ToBinary())
CountFingerprint(#features=1)
>>> list(countops.parse_rdkit_binary(fp.ToBinary()))
[(4294967296, 100)]
>>> countops.parse_rdkit_binary_header(fp.ToBinary())
RDKitBinaryHeader(version=1, index_size=8, num_bits=34359738368,
num_features=1)

as well as to create those strings from chemfp’s count fingerprint:

>>> fp = countops.parse_fpc("1:2,3,4:5")
>>> dict(fp)
{1: 2, 3: 1, 4: 5}
>>> rdk_bin = countops.create_rdkit_binary_UIntSparseIntVect(fp)
>>> from rdkit import DataStructs
>>> rdk_fp = DataStructs.UIntSparseIntVect(rdk_bin)
>>> rdk_fp.GetNonzeroElements()
{1: 2, 3: 1, 4: 5}

Core count fingerprint datatypes

class chemfp.countops.CountFeature_64_32

A feature in a chemfp count fingerprint

This is a direct view of the underlying data, which is an unsigned 64-bit integer for the index and an unsigned 32-bit integer for the count.

DO NOT MODIFY ITS VALUES. Doing so will likely break internal invariants.

index: int

The feature index.

count: int

The feature count.

class chemfp.countops.CountFingerprint

chemfp’s count fingerprint data type

A count fingerprint contains a list of sparse features. Each feature has an index and a count.

Use parse_fpc() or parse_id_and_fpc() to create a CountFingerprint from a string containing and FPC-encoded count fingerprint.

Use parse_rdkit_binary() to create a CountFingerprint from a byte string containing the “ToBinary()” output from an RDKit count fingerprint.

Use the class method CountFingerprint.from_features() to create a CountFingerprint from an iterator of (index, count) pairs.

A count fingerprint is immutable. It cannot be modified.

GetNonzeroElements() Dict[int, int]

An alias for dict(self)

An experimental migration feature which exists for compatibility with RDKit. It may be removed in the future. You should use dict(self).

Returns:

a dictionary mapping index to count

GetTotalValue() int

An alias for CountFingerprint.get_total_count()

Returns the sum of the feature counts.

An experimental migration feature which exists for compatibility with RDKit. It may be removed in the future. You should use get_total_count().

Returns:

an integer

__eq__()

Return True if the two lists of features are identical

__getitem__()

Return an (index, count) pair or list of such pairs

__iter__()

Iterate through the features as (index, count) pairs

__len__()

Return the number of features

__repr__()

Return a string like ‘CountFingerprint(#features=3)’

__str__()

Return the fingerprint as a FPC-encoded string

as_ctypes()

Return a ctype view of the underlying features.

Each (index, count) pair is represented as a ctypes structure CountFeature_64_32 with fields index (c_uint64) and count (c_uint32).

This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.

as_numpy_array() _typing.NumPyArray

Get a view of the features as a NumPy array

classmethod from_features(features: _typing.Iterable[int, int]) CountFingerprint

Create a CountFingerprint from an iterable of (index, count pairs)

The indices must be in increasing order from 0 to 2**64-1. Each count must be a positive integer in the range 1 to 2**32-1.

Returns:

a CountFingerprint

get_total_count() int

Return the sum of the feature counts

to_pandas(*, columns: tuple[str, str] | _typing.Sequence[str] = ('index', 'count')) _typing.PandasDataFrame

Return the feature indices and counts as a Pandas DataFrame

The first column contains the indices, and the second column contains the counts. The default column headers are “index” and “count”. Use columns to specify different header.

Parameters:

columns (a list of two strings) – column names for the returned DataFrame

Returns:

a pandas DataFrame

Parse a string to create a count fingerprint

class chemfp.countops.FPCParseError(errno: int, msg: str, byte_offset: int, location: _Location = None)

Exception type used when parsing an FPC-encoded fingerprint or line.

The public attributes are:

errno: int

The chemfp error code used at the C level.

msg: int

The chemfp error message for the error code.

byte_offset: int

The offset to the byte which caused the error, or to the end of line if the string is incomplete. If the input is a text string then this is the offset in the UTF-8 encoded byte string.

location: chemfp.io.Location

The input string is stored in the Location’s “record” attribute.

__str__() str

Describe the error as a human-readable string

chemfp.countops.parse_fpc(s: str | bytes, stop_at_tab: bool = True) CountFingerprint

Parse an FPC-encoded count fingerprint as a count fingerprint.

The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.

Example:

>>> from chemfp import countops
>>> countops.parse_fpc("1,2:3\tID0001\n")
CountFingerprint(#features=2)
Parameters:
  • s (str or bytes) – An FPC-encoded fingerprint

  • stop_at_tab (bool) – If True, stop parsing at the first tab

Returns:

a CountFingerprint

chemfp.countops.parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, CountFingerprint]

Parse an FPC-encoded count fingerprint as an id and count fingerprint.

The input can be a text or byte string. By default the terminal newline may be present but is required. If require_newline is True then the terminal newline must be present.

Example:

>>> from chemfp import countops
>>> countops.parse_id_and_fpc("1,2:3\tID0001\n")
('ID0001', CountFingerprint(#features=2))
Parameters:
  • s (str or bytes) – A string containing a fingerprint line from an FPC file.

  • require_newline (bool) – If True, the string must end with a newline.

Returns:

a tuple of id string and CountFingerprint

chemfp.countops.parse_fpc_as_dict(s: str | bytes, stop_at_tab: bool = True) Dict[int, int]

Parse an FPC-encoded count fingerprint as a count dictionary.

The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.

Example:

>>> from chemfp import countops
>>> countops.parse_fpc_as_dict("1,2:3\tID0001\n")
{1: 1, 2: 3}
Parameters:
  • s (str or bytes) – An FPC-encoded fingerprint

  • stop_at_tab (bool) – If True, stop parsing at the first tab

Returns:

a dict

chemfp.countops.parse_id_and_fpc_as_dict(s: str | bytes, require_newline: bool = False) tuple[str, Dict[int, int]]

Parse a line from an FPC file as the id and count dictionary.

The input can be a text or byte string. By default the terminal newline may be present but is required. If require_newline is True then the terminal newline must be present.

Example:

>>> from chemfp import countops
>>> countops.parse_id_and_fpc_as_dict("1,2:3\tID0001\n")
('ID0001', {1: 1, 2: 3})
Parameters:
  • s (str or bytes) – A string containing a fingerprint line from an FPC file.

  • require_newline (bool) – If True, the string must end with a newline.

Returns:

a tuple of id string and count dictionary

chemfp.countops.parse_rdkit_binary(rdkit_binary: bytes) CountFingerprint

Convert the RDKit fingerprint bytes to a chemfp count fingerprint.

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

Example:

>>> from rdkit.DataStructs import UIntSparseIntVect
>>> rdk_fp = UIntSparseIntVect(2**30)
>>> rdk_fp[100] = 8
>>> rdk_fp.ToBinary()[:8]
b'\x01\x00\x00\x00\x04\x00\x00\x00`
>>> from chemfp import countops
>>> chemfp_fp = countops.parse_rdkit_binary(rdk_fp.ToBinary())
>>> chemfp_fp
CountFingerprint(#features=1)
>>> list(chemfp_fp)
[(100, 8)]      

NOTE: the indices in the binary string must be in increasing order and have a non-zero count. If not, the returned count fingerprint may be invalid. A future implementation may validate the input.

Parameters:

rdkit_binary (bytes) – An RDKit fingerprint binary string.

Returns:

a CountFingerprint

Create a string given a count fingerprint

chemfp.countops.create_fpc(id: str, fp: CountFingerprint) str

Return the id and fingerprint as a line for an FPC file

This is the same as using f"{fp}\t{id}\n" except that this function will raise a ValueError if the id contains a newline, control-return, tab, or NUL.

Parameters:
  • id (a text string) – the record identifier

  • fp (a CountFingerprint) – a chemfp count fingerprint

Returns:

the record as a line for an FPC file

chemfp.countops.create_fpcstring(fp: CountFingerprint) str

Return the fingerprint as a FPC-encoded string

This is the same as using str(fp).

Parameters:

fp (a CountFingerprint) – a chemfp count fingerprint

Returns:

the count fingerprint features as an FPC-encoded string

chemfp.countops.create_rdkit_binary_UIntSparseIntVect(fp: CountFingerprint, num_bits=4294967295) bytes

Convert a count fingerprint to an RDKit UIntSparseIntVect binary string

The returned byte string can be passed to the UIntSparseIntVect constructor to create the corresponding RDKit fingerprint. For example:

>>> from chemfp import countops
>>> from rdkit import DataStructs
>>> fp = countops.CountFingerprint.from_features([(3, 5)])
>>> rdk_bin = countops.create_rdkit_binary_UIntSparseIntVect(fp, 16)
>>> rdk_bin[:12]
b'\x00\x00\x00\x04\x00\x00\x00\x10\x00\x00\x00'
>>> rdk_fp = DataStructs.UIntSparseIntVect(rdk_bin)
>>> rdk_fp.GetNonzeroElements()
{3: 5}

Each RDKit fingerprint stores its maximum allowed index, specified with num_bits.

If num_bits is smaller than 2**32-1 then every feature index must be less than num_bits. If num_bits is exactly 2**32-1 then the feature index 2**32-1 is also allowed.

If an index is too large then this function raises a ValueError.

Parameters:
  • fp (a CountFingerprint) – a chemfp count fingerprint

  • num_bits (int) – the maximum number of bits for the RDKit fingerprint

Returns:

a byte string used to create a UIntSparseIntVect

chemfp.countops.create_rdkit_binary_ULongSparseIntVect(fp: CountFingerprint, num_bits=18446744073709551615) bytes

Convert a count fingerprint to an RDKit ULongSparseIntVect binary string

The returned byte string can be passed to the ULongSparseIntVect constructor to create the corresponding RDKit fingerprint. For example:

>>> from chemfp import countops
>>> from rdkit import DataStructs
>>> fp = countops.CountFingerprint.from_features([(3, 5)])
>>> rdk_bin = countops.create_rdkit_binary_ULongSparseIntVect(fp, 16)
>>> rdk_bin[:12]
b'\x01\x00\x00\x00\x08\x00\x00\x00\x10\x00\x00\x00'
>>> rdk_fp = DataStructs.ULongSparseIntVect(rdk_bin)
>>> rdk_fp.GetNonzeroElements()
{3: 5}

Each RDKit fingerprint stores its maximum allowed index, specified with num_bits.

If num_bits is smaller than 2**64-1 then every feature index must be less than num_bits. If num_bits is exactly 2**64-1 then the feature index 2**64-1 is also allowed.

If an index is too large then this function raises a ValueError.

Parameters:
  • fp (a CountFingerprint) – a chemfp count fingerprint

  • num_bits (int) – the maximum number of bits for the RDKit fingerprint

Returns:

a byte string used to create a ULongSparseIntVect

Convert a count fingerprint to a byte fingerprint

class chemfp.countops.CountConverter

Base class for converters from count fingerprint to byte fingerprints

A CountConverter is a base class which cannot be used directly. Instead, use FoldedCountConverter or SuperimposedCountConverter or derive your own subclass and implement “get_type” and “create_bytes”.

Every converter has the attribute:

num_bits: positive integer

The number of bits in the output fingerprint.

create_bytes(fp: CountFingerprint) bytes

Convert a chemfp count fingerprint to a byte string

Calling the method on the base class will raise a NotImplementedError. It must be implemented in the subclass.

get_type() str

Return a canonical description of this converter as a string

This should be appropriate for the “type” metadata field.

Calling the method on the base class will raise a NotImplementedError. It must be implemented in the subclass.

parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes

Parse an FPC-encoded count fingerprint to get a byte fingerprint

The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.

This is equivalent to using countops.parse_fpc() to parse the input FPC string to get a CountFingerprint then converting that fingerprint to bytes with CountConverter.create_bytes().

Parameters:
  • s (str or bytes) – A string containing an FPC-encoded fingerprint

  • stop_at_tab (bool) – If True, stop parsing at the first tab

Returns:

bytes

parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]

Parse a line from an FPC file to get the id and converted byte fingerprint

The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.

This is equivalent to using countops.parse_id_and_fpc() to parse the input FPC string to get the id and a CountFingerprint then converting that fingerprint to bytes with CountConverter.create_bytes().

Parameters:
  • s (str or bytes) – A string containing a fingerprint line from an FPC file.

  • require_newline (bool) – If True, the string must end with a newline.

Returns:

a tuple of id string and bytes

parse_rdkit_binary(rdkit_binary: bytes)

Parse an RDKit fingerprint binary string and convert to a byte fingerprint.

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

The result is a binary fingerprint as a byte string.

Parameters:

rdkit_binary (bytes) – An RDKit fingerprint binary string.

Returns:

bytes

class chemfp.countops.FoldedCountConverter

Convert a count fingerprint to a binary fingerprint using folding.

This converter implements modulo folding based on each feature index. The count is ignored.

num_bits: positive integer

The number of bits in the output fingerprint.

hash: bool

If True, use the index to seed a PRNG and generate the value used to fold, instead of folding on the index. Use this if the index values are not not well distributed after folding.

For example, if the index contains atom type triplets with 5 bits for atom 1, 5 bits for atom 2, and 5 bits for atom 3, then modulo folding on 1024 bits is the same as using the 10 bits for atoms 1 and 2 while ignoring the 5 bits for atom 3.

create_bytes(fp: CountFingerprint) bytes

Convert a chemfp count fingerprint to a folded binary fingerprint

get_type() str

Return a canonical type string for the folded parameters

parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes

Parse an FPC-encoded count fingerprint as folded bytes.

The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.

The FPC-encoded count fingerprint is folded directly to byte fingerprint.

Parameters:
  • s (str or bytes) – A string containing an FPC-encoded fingerprint

  • stop_at_tab (bool) – If True, stop parsing at the first tab

Returns:

bytes

parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]

Parse a line from an FPC file as the id and folded byte fingerprint

The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.

The implementation parses the input FPC string directly to extract the id and create the folded fingerprint bytes.

Parameters:
  • s (str or bytes) – A string containing a fingerprint line from an FPC file.

  • require_newline (bool) – If True, the string must end with a newline.

Returns:

a tuple of id string and bytes

parse_rdkit_binary(rdkit_binary: bytes)

Parse an RDKit fingerprint binary string and convert to a folded fingerprint.

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

The RDKit count fingerprint is folded directly to a byte fingerprint.

Parameters:

rdkit_binary (bytes) – An RDKit fingerprint binary string.

Returns:

bytes

class chemfp.countops.SuperimposedCountConverter

Convert a count fingerprint to a binary fingerprint using random superimposed coding.

Each feature has an index and a count. For each feature, the index is used to seed a pseudo-random number generator to generates min(count, max_count)*bits_per_count random values from 0 to num_bits-1 (duplicates are allowed).

Using bits_per_count values > 1 appears to make the Tanimoto scores less reliable.

If the on-bit density of the output binary fingerprint is low (~5% or less) then the binary Tanimoto between two output fingerprints is a good approximation to the count Tanimoto between the two input count fingerprints.

num_bits: positive integer

The number of bits in the output fingerprint.

bits_per_count: positive integer

The number of bits to set for each count of each feature. The default is 1. It does not seem useful to set more than 1 bit per count.

max_count: positive integer

If the feature count is larger than max_count then use max_count instead. The largest possible value is 2**32-1. The default is 1000, which prevent features like 0:1000000000 from generating 1 billion random values.

create_bytes(fp: CountFingerprint) bytes

Convert a chemfp count fingerprint to a superimposed binary fingerprint

get_type() str

Return a canonical type string for the superimposed parameters

parse_fpc(s: str | bytes, stop_at_tab: bool = True) bytes

Parse an FPC-encoded count fingerprint as superimposed bytes.

The input can be a text or byte string. By default parsing will stop at the end of string or the first tab. If stop_at_tab is False then the entire string must an FPC-encoded fingerprint.

Parameters:
  • s (str or bytes) – An FPC-encoded fingerprint

  • stop_at_tab (bool) – If True, stop parsing at the first tab

Returns:

bytes

parse_id_and_fpc(s: str | bytes, require_newline: bool = False) tuple[str, bytes]

Parse a line from an FPC file as the id and superimposed byte fingerprint

The input can be a text or byte string. By default the terminal newline may be present but is not required. If require_newline is True then the terminal newline must be present.

The implementation parses the input FPC string directly to extract the id and create the superimposed fingerprint bytes.

Parameters:
  • s (str or bytes) – A string containing a fingerprint line from an FPC file.

  • require_newline (bool) – If True, the string must end with a newline.

Returns:

a tuple of id string and bytes

parse_rdkit_binary(rdkit_binary: bytes) bytes

Parse an RDKit fingerprint binary string to get the superimposed binary fingerprint

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

The RDKit count fingerprint binary is superimposed directly to a byte fingerprint.

Parameters:

rdkit_binary (bytes) – An RDKit fingerprint binary string.

Returns:

bytes

chemfp.countops.create_folded_bytes(fp: CountFingerprint, num_bits: int = 2048, hash: bool = False) bytes

Convert a chemfp count fingerprint to a folded byte fingerprint

For each feature index, modulo fold it to num_bits and set the corresponding bit in the output byte fingerprint to 1.

If hash is True, use the index to seed a PRNG and generate the value used to fold, instead of folding on the index. Use this if the index values are not not well distributed after folding.

This is equivalent to:

FoldedCountConverter(num_bits, hash).create_bytes(fp)
Parameters:
  • fp (a CountFingerprint) – a count fingerprint

  • num_bits (a positive integer) – the number of bits for the output fingerprint

  • hash (either True or False) – if True, fold a hash of the index instead of the index

Returns:

the folded fingerprint as a byte string

chemfp.countops.create_superimposed_bytes(fp: CountFingerprint, num_bits=2048, bits_per_count=1, max_count=1000) bytes

Superimpose a count fingerprint to a byte fingerprint

Used random superimposed coding to convert the CountFingerprint to a byte fingerprint with num_bits bits.

Each feature has an index and a count. For each feature, the index is used to seed a pseudo-random number generator to generates min(count, max_count)*bits_per_count random values from 0 to num_bits-1 (duplicates are allowed).

Using bits_per_count values > 1 appears to make the Tanimoto scores less reliable.

This is equivalent to:

SuperimposedCountConverter(
    num_bits, bits_per_count, max_count).create_bytes(fp)
Parameters:
  • fp (a CountFingerprint) – a chemfp count fingerprint

  • num_bits (a positive integer) – the number of bits in the output byte fingerprint

  • bits_per_count (a positive integer (should be 1)) – a multiplier to the number of generated values

  • max_count (a positive integer) – the upper bound for the count to use

Returns:

a CountFingerprint

Functions which work on count fingerprints

chemfp.countops.count_tanimoto(fp1: CountFingerprint, fp2: CountFingerprint) float

Compute the Tanimoto between two chemfp count fingerprints

This computes the multiset Tanimoto, defined as the sum of the minimum counts for the indices in common, divided by the sum of the maximum counts for both indices.

If there are no features then the Tanimoto is 0.0.

Example:

>>> from chemfp import countops
>>> fp1 = countops.parse_fpc("1:1,2:8")
>>> fp2 = countops.parse_fpc("1:3,3:5")
>>> countops.count_tanimoto(fp1, fp2)
0.0625
>>>
Parameters:
Returns:

the Tanimoto as a float

chemfp.countops.dict_tanimoto(fp1: Dict[int, int], fp2: Dict[int, int]) float

Compute the Tanimoto between two count dictionaries

A count dictionary is a dictionary mapping an integer index to an integer count.

This computes the multiset Tanimoto, defined as the sum of the minimum counts for the indices in common, divided by the sum of the maximum counts for both indices.

If there are no features then the Tanimoto is 0.0.

Example:

>>> from chemfp import countops
>>> countops.dict_tanimoto({1: 1, 2: 8}, {1: 3, 3: 5})
0.0625
>>>
Parameters:
  • fp1 (a count dictionary) – a count fingerprint represented as a dictionary

  • fp2 (a count dictionary) – a second count fingerprint represented as a dictionary

Returns:

the Tanimoto as a float

Work with RDKit fingerprint binary strings

class chemfp.countops.RDKitBinaryHeader(version: int, index_size: int, num_bits: int, num_features: int)

Header information for an RDKit sparse fingerprint binary string

These are strings generated by any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

The public attributes are:

version

The version number. Only version 1 is supported.

index_size

The number of bytes for the feature index. Only 4 and 8 are supported.

num_bits

The number of bits in the sparse fingerprint.

num_features

The number of features in the sparse fingerprint.

get_feature_size() int

Return the number of bytes used for each feature record

get_header_size() int

Return the number of bytes needed for the header

get_total_size() int

Return the number of bytes needed for the entire string

class chemfp.countops.parse_rdkit_binary_header(rdkit_binary: bytes, check_total_size: bool = False)

Parse the header information from the ToBytes() of an RDKit fingerprint.

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

Example:

>>> from rdkit.DataStructs import ULongSparseIntVect
>>> rdk_fp = ULongSparseIntVect(2**42)
>>> rdk_fp[2**41+100] = 12345
>>> from chemfp import countops
>>> countops.parse_rdkit_binary_header(rdk_fp.ToBinary())
RDKitBinaryHeader(version=1, index_size=8,
num_bits=4398046511104, num_features=1)

The header contains a version, the number of byte per index, the maximum number of bits for a fingerprint, and the total number of features.

By default only parse and validate the header values. Must be version 1 with an index size of either 4 or 8.

If check_total_size is True then also check that the the length of the rdkit_binary byte string is correct. The default is False.

Parameters:
  • rdkit_binary (bytes) – An RDKit fingerprint binary string.

  • check_total_size (bool) – If True, check if rdkit_binary is the right size

Returns:

a RDKitBinaryHeader

chemfp.countops.create_fpcstring_from_rdkit_binary(rdkit_binary: bytes) str

Convert an RDKit fingerprint binary string to an FPC-encoded fingerprint

This can parse the byte string returned from any of RDKit’s:

  • IntSparseIntVect().ToBytes()

  • UIntSparseIntVect().ToBytes()

  • LongSparseIntVect().ToBytes()

  • ULongSparseIntVect().ToBytes()

Example:

>>> from rdkit.DataStructs import ULongSparseIntVect
>>> rdk_fp = ULongSparseIntVect(2**42)
>>> rdk_fp[1_000_000_000_000] = 1912
>>> from chemfp import countops
>>> countops.create_fpcstring_from_rdkit_binary(rdk_fp.ToBinary())
'1000000000000:1912'

NOTE: the indices in the binary string must be in increasing order and have a non-zero count. If not, the returned string may be an invalid FPC string. A future implementation may validate the input.

Returns:

a string in FPC format