chemfp.arena module¶
Algorithms and data structure for working with a FingerprintArena.
This is an internal chemfp module. It should not be imported by programs which use the public API. (Let me know if anything else should be part of the public API.)
This module contains class definitions for a objects which are returned
as part of the public API. A FingerprintArena
stores
fingerprints in a contiguous block of memory, along with their
associated ids. A FingerprintList
provides a list-like view to
the fingerprints.
- class chemfp.arena.FingerprintArena(metadata: _typing.Metadata, alignment: int, start_padding: int, end_padding: int, storage_size: int, arena: _typing.ArenaBytes, popcount_indices: bytes, arena_ids: _typing.IdList, start: int = 0, end: _typing.Optional[int] = None, id_lookup: _typing.OptionalIdLookupFunc = None, num_bits: _typing.Optional[int] = None, num_bytes: _typing.Optional[int] = None, license_key: bytes = b'')¶
Bases:
FingerprintReader
Store fingerprints in a contiguous block of memory for fast searches
A FingerprintArena implements the
chemfp.FingerprintReader
API.The fingerprints in a continuous block of memory so the per-molecule overhead is very low. The block is named
arena
. The first fingerprint starts at the offsetstart_padding
and each fingerprint takesstorage_size
bytes, so fingerprint i is located at:self.arena[self.start_padding + i * self.storage_size: self.start_padding + (i+1) * self.storage_size ]
The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If
self.popcount_indices
is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the BitBound search algorithm.A FingerprintArena is its own context manager, but it does nothing on context exit. The derived
FPBFingerprintArena
may use a memory-mapped FPB file, which will be closed by the context manager or by an explicit call to close().The public attributes are:
- metadata: chemfp.Metadata¶
A
chemfp.Metadata
with information about the fingerprints.
- alignment: int¶
The fingerprint memory alignment.
- num_bits: int¶
The number of bits in each fingerprint. Must be
(num_bytes-1)*8 < num_bits <= num_bytes*8
.
- num_bytes: int¶
The number of bytes in each fingerprint (must be <= storage_size).
- start_padding: int¶
The number of bytes to the first fingerprint in the arena block.
- end_padding: int¶
The number of bytes after the last fingerprint in the arena block.
- storage_size: int¶
The number of bytes used to store a fingerprint. This may be larger than the fingerprint num_bytes due to padding. The padding is supposed to be zeros.
- arena: bytes¶
A contiguous block of memory containing the fingerprints and possible other data. This is typically a byte string or memory-map.
- popcount_indices: bytes¶
A byte string containing the offset indices for fingerprints sorted by popcount. You should use
get_popcount_offsets()
instead of this string. It is b”” if there is no index.
- start: int¶
The index for the first fingerprint in the arena. May be greater than 0 if this is a subarena starting after the start of the arena. (It is not a byte position!)
- end: int¶
If a subarena, one more than the index of the last fingerprint relative to the start of the parent arena. Will be the number of total fingerprints if this is not a subarena.
- fingerprints: FingerprintList¶
Provide a
FingerprintList
list-like access to the fingerprints, in index order.
- close()¶
Close any resources associated with this arena
If the arena uses a memory-mapped file (eg, an FPB file) then this will close the file.
- copy(indices: _typing.Optional[_typing.Sequence[int]] = None, reorder: _typing.Optional[bool] = None, metadata: _typing.OptionalMetadata = None, ids: _typing.Optional[_typing.Sequence[str]] = None) FingerprintArena ¶
Create a new arena using either all or some of the fingerprints in this arena
By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or “sub-arena” of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids.
The indices parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged.
If indices are specified then the default reorder value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If reorder is False then the new arena will preserve the order given by the indices.
If indices are not specified, then the default is to preserve the order type of the original arena. Use
reorder=True
to always reorder the fingerprints in the new arena by popcount, andreorder=False
to always leave them in the current ordering.>>> import chemfp >>> arena = chemfp.load_fingerprints("pubchem_queries.fps") >>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18] (b'9425031', b'9425015', b'9425040', b'9425033') >>> len(arena) 19 >>> new_arena = arena.copy(indices=[1, 5, 10, 18]) >>> len(new_arena) 4 >>> new_arena.ids [b'9425031', b'9425015', b'9425040', b'9425033'] >>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False) >>> new_arena.ids [b'9425033', b'9425040', b'9425015', b'9425031']
If metadata is not None then it will be the metadata of the new copy.
Use ids to specify the identifiers for the new copy. It is especially useful a way to preserve the initial fingerprint index in the original arena.
- Parameters:
indices (iterable containing integers, or None) – indicies of the records to copy into the new arena
reorder (True to reorder, False to leave in input order, None for default action) – describes how to order the fingerprints
metadata (a
chemfp.Metadata
or None) – the metadata to use in the new copyids (a list of values, or None to keep the original identifiers) – replacement identifiers to use in the copy
- Returns:
a new
FingerprintArena
- count_tanimoto_hits_arena(queries: _typing.FingerprintArena, threshold: float = 0.7) _typing.List[int] ¶
Count the fingerprints which are sufficiently similar to each query fingerprint
Returns a list containing a count for each query fingerprint in the queries arena. The count is the number of fingerprints in the arena which are at least threshold similar to the query fingerprint.
The order of results is the same as the order of the queries.
- Parameters:
queries (a
FingerprintArena
) – query fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
list of integer counts, one for each query
- count_tanimoto_hits_fp(query_fp: bytes, threshold: float = 0.7) int ¶
Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
- Parameters:
query_fp (byte string) – query fingerprint
threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
integer count
- count_tversky_hits_fp(query_fp: bytes, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) int ¶
Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
- Parameters:
query_fp (byte string) – query fingerprint
threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
integer count
- get_bit_counts() array ¶
Count the number of on bits for each position in the fingerprint
This function returns an array.array of length num_bits integers. Use get_bit_counts_as_numpy() to return a NumPy array.
- Returns:
an array.array of length num_bits with 4-byte signed integers
- get_bit_counts_as_numpy() _typing.NumPyArray ¶
Count the number of on bits for each position in the fingerprint
This function returns an NumPy array of length num_bits integers. Use get_bit_counts() to return an array.array.
- Returns:
a NumPy array of length num_bits and type int32
- get_by_id(id) Tuple[str, bytes] | None ¶
Given the record identifier, return the (id, fingerprint) pair,
If the id is not present then return None.
- get_fingerprint(i) bytes ¶
Return the fingerprint at index i
Raises an IndexError if index i is out of range.
- get_fingerprint_by_id(id) bytes | None ¶
Given the record identifier, return its fingerprint
If the id is not present then return None
- get_index_by_id(id) int | None ¶
Given the record identifier, return the record index.
If the id is not present then return None.
- get_popcount_offsets() array | None ¶
Return the popcount index as an array.array(“I”) of offsets.
If c is an index into the array arr then the fingerprint at offset i, where arr[c] <= i < arr[c+1], has a popcount of c. For example, the fingerprints with 10 bits set are in the range arr[10] <= i < arr[11]. If arr[c] == arr[c+1] then there are no fingerprints with popcount c.
If the arena does not have a popcount index (which occurs if the arena is not sorted by popcount) then this method returns None.
- property ids: Sequence[str]¶
Return the identifiers in this arena or subarena.
- iter_arenas(arena_size: _typing.OptionalInt = 1000) _typing.FingerprintArenaIterator ¶
Base class for all chemfp objects holding fingerprint records
All FingerprintReader instances have a
metadata
attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.
- knearest_tanimoto_search_arena(queries: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0) _typing.SearchResults ¶
Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
are sorted by similarity score.- Parameters:
queries (a
FingerprintArena
) – query fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
- Returns:
- knearest_tanimoto_search_fp(query_fp: bytes, k: int = 3, threshold: float = 0.0) _typing.SearchResult ¶
Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.- Parameters:
queries (a
FingerprintArena
) – query fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
- Returns:
- knearest_tversky_search_fp(query_fp: bytes, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0) _typing.SearchResult ¶
Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.- Parameters:
query_fp (byte string) – query fingerprint
k (positive integer) – maximum number of neighbors to find
threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
alpha (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
beta (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
- Returns:
- random_choice(rng=None) Tuple[str, bytes] ¶
return a randomly selected (id, fp) pair
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.- Parameters:
rng (None, int, or a random.Random()) – method to use for random sampling
- Returns:
a 2-element tuple of identifier string and fingerprint bytes
- sample(num_samples: int | float, reorder: bool = True, rng=None) FingerprintArena ¶
return a new arena containing num_samples randomly selected fingerprints, without replacement
If num_samples is an integer then it must be between 0 and the size of the arena. If num_samples is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include.
By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.- Parameters:
num_samples (int or float) – number of fingerprints to select
reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
rng (None, int, or a random.Random()) – method to use for random sampling
- Returns:
- save(destination: str | bytes | Path | None | BinaryIO, format: str | None = None, level: None | int | Literal['min', 'default', 'max'] = None)¶
Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.
- Parameters:
destination (a filename, file object, or None) – the output destination
format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files
- Returns:
None
- threshold_tanimoto_search_arena(queries: _typing.FingerprintArena, threshold: float = 0.7) _typing.SearchResults ¶
Find the fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find all of the fingerprints in this arena which are at least threshold similar. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
is in arbitrary order.- Parameters:
queries (a
FingerprintArena
) – query fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
- threshold_tanimoto_search_fp(query_fp, threshold=0.7) _typing.SearchResult ¶
Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.- Parameters:
query_fp (byte string) – query fingerprint
threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
- threshold_tversky_search_fp(query_fp: bytes, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) _typing.SearchResult ¶
Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.- Parameters:
query_fp (byte string) – query fingerprint
threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
- Returns:
- to_numpy_array() _typing.NumPyArray ¶
Get the fingerprint bytes in a chemfp arena as NumPy uint8 array.
A chemfp arena stores fingerprints in a contiguous byte string. This function returns a 2D NumPy array which is a view of that string. The array has len(arena) rows and arena.storage_size columns.
The storage size may be larger than the minimum number of bytes in the fingerprint because of zero padding used to improve performance. For example, the 166-bit MACCS keys uses 24 bytes of storage when only 21 bytes are needed, because then chemfp can use the fast POPCNT instruction when computing the Tanimoto.
To remove extra padding bytes, use NumPy indexing to copy the fingerprint bytes to a new array:
arr[:,0:arena.num_bytes]
The last column of this new array may contain padding bits if the number of bits in a fingerprint is not a multiple of 8.
Warning
Do not attempt to access the contents of a NumPy view of a FPBFingerprintArena (the arena from an FPB file) after the FPB file has been closed as that will likely cause a segmentation fault or other severe failure.
- Returns:
a NumPy array of type uint8
- to_numpy_bitarray(bitlist=None) _typing.NumPyArray ¶
Get the fingerprint bits in a chemfp arena as NumPy uint8 array.
This function returns a 2D NumPy array with len(arena) rows and one column for each bit. The default returns arena.num_bits columns, where column 0 is the first bit, etc. Use bitlist to specify the indicies of which columns to return. Negative indices are supported; -1 is the last bit, -2 is the second to last. Out of range indices raise an IndexError.
- Parameters:
bitlist (iterable of integers) – bit column indices to use (default: all bits)
- Returns:
a NumPy array of type uint8
- train_test_split(train_size: None | int | float = None, test_size: None | int | float = None, reorder: bool = True, rng=None) Tuple[FingerprintArena, FingerprintArena] ¶
return arenas containing train_size and test_size randomly selected fingerprints, without replacement
If train_size is an integer then it must be between 0 and the size of the arena. If train_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If train_size is None then it is set to the complement of test_size. If both train_size and test_size are None then the default train_size is 0.75.
If test_size is an integer then it must be between 0 and the size of the arena. If test_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If test_size is None then it is set to the complement of train_size. If both test_size and train_size are None then the default test_size is 0.25.
By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.This method API is modelled on scikit-learn’s model_selection.train_test_split() function, described at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Parameters:
train_size (int, float, or None) – number of fingerprints for the training set arena
test_size (int, float, or None) – number of fingerprints for the test set arena
reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
rng (None, int, or a random.Random()) – method to use for random sampling
- Returns:
a training set
FingerprintArena
and a test setFingerprintArena
- class chemfp.arena.FingerprintList(start_padding: int, storage_size: int, arena, start: int, end: int, num_bytes: int)¶
Bases:
Sequence
A read-only list-like view of the arena fingerprints
This implements the standard Python list API, including indexing and iteration.
Note: fingerprint searches like “fp in fingerprint_list” and “fingerprint_list.index(fp)” are not fast.
- random_choice(rng=None) bytes ¶
Return a randomly selected fingerprint.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.- Parameters:
rng (None, int, or a random.Random()) – method to use for random sampling
- Returns:
a 2-element tuple of identifier string and fingerprint bytes