chemfp.search module¶
Search a FingerprintArena or generate a simarray, and work with the results.
This module implements the low-level APIs for similarity search of a
FingerprintArena
. These are used by the high-level
chemfp.simsearch()
function for sparse searches, and the
chemfp.simarray()
function for complete array generation.
Low-level simsearch API¶
The low-level functions comparable to a sparse chemfp.simarray()
search are:
Count the number of hits¶
count_tanimoto_hits_fp()
- search an arena using a single fingerprintcount_tanimoto_hits_arena()
- search an arena using an arenacount_tanimoto_hits_symmetric()
- search an arena using itselfpartial_count_tanimoto_hits_symmetric()
- (advanced use; see the doc string)count_tversky_hits_fp()
- search an arena using a single fingerprintcount_tversky_hits_arena()
- search an arena using an arenacount_tversky_hits_symmetric()
- search an arena using itselfpartial_count_tversky_hits_symmetric()
- (advanced use; see the doc string)
Find all hits at or above a given threshold, sorted arbitrarily¶
threshold_tanimoto_search_fp()
- search an arena using a single fingerprintthreshold_tanimoto_search_arena()
- search an arena using an arenathreshold_tanimoto_search_symmetric()
- search an arena using itselfpartial_threshold_tanimoto_search_symmetric()
- (advanced use; see the doc string)threshold_tversky_search_fp()
- search an arena using a single fingerprintthreshold_tversky_search_arena()
- search an arena using an arenathreshold_tversky_search_symmetric()
- search an arena using itselfpartial_threshold_tversky_search_symmetric()
- (advanced use; see the doc string)fill_lower_triangle()
- copy the upper triangle terms to the lower triangle
Find the k-nearest hits at or above a given threshold¶
These are sorted by decreasing similarity.
knearest_tanimoto_search_fp()
- search an arena using a single fingerprintknearest_tanimoto_search_arena()
- search an arena using an arenaknearest_tanimoto_search_symmetric()
- search an arena using itselfknearest_tversky_search_fp()
- search an arena using a single fingerprintknearest_tversky_search_arena()
- search an arena using an arenaknearest_tversky_search_symmetric()
- search an arena using itself
Sparse search result types¶
The threshold and k-nearest search results use a SearchResult
when a fingerprint is used as a query, or a SearchResults
when
an arena is used as a query. These internally use a compressed sparse
row format.
Save or load the sparse search results¶
Use load_npz()
to load a SearchResults
or
SearchResult
saved in SciPy “npz” format along with the extra
arrays that chemfp uses for metadata and ids.
There is also a save_npz()
to save those to “npz” format, but
it’s probably easier to use the SearchResults.save()
or
SearchResult.save()
method.
save_npz()
- save aSearchResults
load_npz()
- laod a :class:
Low-level simarray API¶
This module also implements the low-level API to generate a NumPy array containing all pairwise fingerprint comparisons. The main ones to use are the class constructor methods:
SimarrayProcessor.from_query_fp()
- generate all scores for a query fingerprint and the targetsSimarrayProcessor.from_symmetric()
- generate all scores in the target fingerprints with itselfSimarrayProcessor.from_NxM()
- generate all scores between the queries and the targets
which all return a SimarrayProcessor
.
- exception chemfp.search.DTypeValueError(msg, dtype, metric_config)¶
Bases:
ValueError
Exception type raised when a metric does not support the given dtype
- Parameters:
msg (str) – the error message
dtype (a string or NumPy dtype) – the user-specified dtype
- property metric_name: str¶
Return the metric name
- property supported_dtypes: list[str]¶
Return the list of supported dtypes for this metric
- class chemfp.search.MatrixType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
IntEnum
Enumeration values which describe the matrix type
- static from_npz_name(s: str) MatrixType ¶
Return the MatrixType enum value given a chemfp npz type name
Raise KeyError if it does not exist.
- get_npz_name() str ¶
Return the chemfp npz type name given a MatrixType enum value
Raise KeyError if it does not exist.
- static get_npz_names() list[str] ¶
Return the list of supported chemfp npz type names
- static is_square(value: MatrixType) bool ¶
Return True if the matrix type is one of the NxN forms, including upper-triangular
- class chemfp.search.SearchResult(search_results: SearchResults, row: int)¶
Bases:
object
Search results for a query fingerprint against a target arena.
The results contains a list of hits. Hits contain a target index, score, and optional target ids. The hits can be reordered based on score or index.
- as_buffer() memoryview ¶
Return a Python buffer object for the underlying indices and scores.
This provides a byte-oriented view of the raw data. You probably want to use as_ctypes() or as_numpy_array() to get the indices and scores in a more structured form.
Warning
Do not attempt to access the buffer contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
- Returns:
a Python buffer object
- as_ctypes()¶
Return a ctypes view of the underlying indices and scores
Each (index, score) pair is represented as a ctypes structure named Hit with fields index (c_int) and score (c_double).
For example, to get the score of the 5th entry use:
result.as_ctypes()[4].score
This method returns an array of type (Hit*len(search_result)). Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the ctype array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.
- Returns:
a ctypes array of type Hit*len(self)
- as_numpy_array() _typing.NumPyArray ¶
Return a NumPy array view of the underlying indices and scores
The view uses a structured types with fields ‘index’ (i4) and ‘score’ (f8), mapped directly onto chemfp’s own data structure. For example, to get the score of the 4th entry use:
result.as_numpy_array()["score"][3] -or- result.as_numpy_array()[3][1]
Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the NumPy array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
As a short-hand to get just the indices or just the scores, use get_indices_as_numpy_array() or get_scores_as_numpy_array().
- Returns:
a NumPy array with a structured data type
- count(min_score: float | None = None, max_score: float | None = None, interval: Literal['[]', '[)', '(]', '()'] = '[]') int ¶
Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
an integer count
- cumulative_score(min_score: float | None = None, max_score: float | None = None, interval: Literal['[]', '[)', '(]', '()'] = '[]')¶
The sum of the scores which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
a floating point value
- format_ids_and_scores_as_bytes(ids: list[str] | None = None, precision: Literal[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] = 4)¶
Format the ids and scores as the byte string needed for simsearch output
If there are no hits then the result is the empty string b””, otherwise it returns a byte string containing the tab-seperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], …
If the ids is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings.
The precision sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive.
This function is 3-4x faster than the Python equivalent, which is roughly:
ids = ids if (ids is not None) else self.get_ids() formatter = ("%s\t%." + str(precision) + "f").encode("ascii") return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores()))
- Parameters:
ids (a list of Unicode strings, or None to use the default) – the identifiers to use for each hit.
precision (an integer from 1 to 10, inclusive) – the precision to use for each score
- Returns:
a byte string
- get_ids() list[str] | None ¶
The list of target identifiers (if available), in the current ordering
- Returns:
a list of strings
- get_ids_and_scores() list[tuple[str, float]] ¶
The list of (target identifier, target score) pairs, in the current ordering
Raises a TypeError if the target IDs are not available.
- Returns:
a Python list of 2-element tuples
- get_indices() list[int] ¶
The list of target indices, in the current ordering.
This returns a copy of the scores. See
get_indices_as_numpy_array()
to get a NumPy array view of the indices.- Returns:
an array.array() of type ‘i’
- get_indices_and_scores() list[tuple[int, float]] ¶
The list of (target index, target score) pairs, in the current ordering
- Returns:
a Python list of 2-element tuples
- get_indices_as_numpy_array() _typing.NumPyArray ¶
Return a NumPy array view of the underlying indices.
This is a short-cut for self.as_numpy_array()[“index”]. See that method documentation for details and warning.
- Returns:
a NumPy array of type ‘i4’
- get_scores() list[float] ¶
The list of target scores, in the current ordering
This returns a copy of the scores. See
get_scores_as_numpy_array()
to get a NumPy array view of the scores.- Returns:
an array.array() of type ‘d’
- get_scores_as_numpy_array() _typing.NumPyArray ¶
Return a NumPy array view of the underlying scores.
This is a short-cut for self.as_numpy_array()[“score”]. See that method documentation for details and warning.
- Returns:
a NumPy array of type ‘f8’
- iter_ids() Iterator[str] ¶
Iterate over target identifiers (if available), in the current ordering
- max() float ¶
Return the value of the largest score
Returns 0.0 if there are no results.
- Returns:
a float
- min() float ¶
Return the value of the smallest score
Returns 0.0 if there are no results.
- Returns:
a float
- property query_id: str | None¶
Return the corresponding query id, if available, else None
- reorder(ordering: Literal['increasing-score', 'decreasing-score', 'increasing-score-plus', 'decreasing-score-plus', 'increasing-index', 'decreasing-index', 'move-closest-first', 'reverse'] = 'decreasing-score-plus') None ¶
Reorder the hits based on the requested ordering.
The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- save(destination: str | bytes | Path | None | BinaryIO, format: Literal[None, 'npz'] = None, compressed: bool = True)¶
Save the SearchResult to the given destination
Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResult is stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a single SearchResult.
Use chemfp.search.load_npz() to read the similarity search result back into a SearchResult instance.
- Parameters:
destination (a filename, binary file object, or None for stdout) – where to write the results
format (None or 'npz') – the output format name (default: always ‘npz’)
compressed – if True (the default), use zipfile compression
- property target_ids: Sequence[str] | None¶
Return all the original target ids, not just the hit ids for this query
- property target_start: int¶
Return the start offset of the hit into the original arena
- to_pandas(*, columns: tuple[str, str] = ('target_id', 'score')) _typing.PandasDataFrame ¶
Return a pandas DataFrame with the target ids and scores
The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
- class chemfp.search.SearchResults(num_rows: int, num_cols: int, query_arena=None, query_ids=None, target_arena=None, target_arena_ids=None, target_start=0, num_bits=2147483647, alpha=1.0, beta=1.0, matrix_type=MatrixType.NxM)¶
Bases:
SearchResults
Search results for a list of query fingerprints against a target arena
This acts like a list of SearchResult elements, with the ability to iterate over each search results, look them up by index, and get the number of scores.
In addition, there are helper methods to iterate over each hit and to get the hit indicies, scores, and identifiers directly as Python lists, sort the list contents, and more.
The public attributes are:
- num_rows: int¶
The number of queries.
- num_columns: int¶
The number of targets.
- alpha: beta¶
The Tversky alpha value (1.0 for Tanimoto search).
- beta: float¶
The Tversky beta value (1.0 for Tanimoto search).
- query_ids: a list[str]-like object¶
A list of query ids, one for each result. This comes from the query arena’s ids.
- target_arena_ids: a list[str]-like object¶
The full list of ids for the underlying target arena, which may be larger than the subarena used for the search.
- count_all(min_score: float | None = None, max_score: float | None = None, interval: Literal['[]', '[)', '(]', '()'] = '[]') int ¶
Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
an integer count
- cumulative_score_all(min_score: float | None = None, max_score: float | None = None, interval: Literal['[]', '[)', '(]', '()'] = '[]') float ¶
The sum of all scores in all rows which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
a floating point count
- iter_ids() Iterable[str] ¶
For each hit, yield the list of target identifiers
- iter_ids_and_scores() Iterable[tuple[str, float]] ¶
For each hit, yield the list of (target id, score) tuples
- iter_indices() Iterable[int] ¶
For each hit, yield the list of target indices
- iter_indices_and_scores() Iterable[tuple[int, float]] ¶
For each hit, yield the list of (target index, score) tuples
- iter_scores() Iterable[list[float]] ¶
For each hit, yield the list of target scores
- property matrix_type: MatrixType | int¶
Get the MatrixType enum value, or an integer if out of range
- property matrix_type_name: str¶
Get a string describing the matrix type, or ‘unknown’ if for unknown types
- reorder_all(ordering: Literal['increasing-score', 'decreasing-score', 'increasing-score-plus', 'decreasing-score-plus', 'increasing-index', 'decreasing-index', 'move-closest-first', 'reverse'] = 'decreasing-score-plus') None ¶
Reorder the hits for all of the rows based on the requested order.
The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- save(destination: str | bytes | Path | None | BinaryIO, format: Literal[None, 'npz'] = None, compressed: bool = True)¶
Save the SearchResults to the given destination
Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.
Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.
- Parameters:
destination (a filename, binary file object, or None for stdout) – where to write the results
format (None or 'npz') – the output format name (default: always ‘npz’)
compressed – if True (the default), use zipfile compression
- property shape: tuple[int, int]¶
the tuple (number of rows, number of columns)
The number of columns is the size of the target arena.
- property target_ids: Sequence[str] | None¶
The target ids
Be aware that these indices always start from 0 and go up to the number of columns, while the indices from get_indices() and related functions are relative to the initial position in the target arena. If the targets were a subarena slice which did not start from 0 then the indices will differ by
target_start
.Use
target_arena_ids
to get the ids indexed across their full range.
- to_csr(dtype: _typing.OptionalNumPyDType = None) _typing.SciPyCSRMatrix ¶
Return the results as a SciPy compressed sparse row matrix.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that SciPy (and NumPy) be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_numpy_array(dtype: _typing.OptionalNumPyDType = None) _typing.NumPyArray ¶
Return the results as a (dense) NumPy array
The returned matrix has the same shape as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that NumPy be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_pandas(*, columns: tuple[str, str, str] = ('query_id', 'target_id', 'score'), empty: _typing.EmptySpecifier = ('*', None)) _typing.PandasDataFrame ¶
Return a pandas DataFrame with query_id, target_id and score columns
Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
- Parameters:
columns (a list of three strings) – column names for the returned DataFrame
empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
- Returns:
a pandas DataFrame
- class chemfp.search.SimarrayContent¶
Bases:
object
Base class for objects containing simarray data.
These include the low-level
SimarrayProcessor
, the high-levelSimarrayResult
, and theSimarrayFileContent
for simarray data read from a file.The public attributes are:
- out: a NumPy array¶
A NumPy array containing either the comparison values, or if from a metadata file, a (0,) or (0,0)-sized array.
- query_ids: a list[str]-like object or None¶
The query identifiers, or None if not available.
- target_ids: a list[str]-like object¶
The target identifiers.
- num_bits: int¶
The number of bits in the fingerprint.
- dtype_str: str¶
The chemfp simarray dtype name.
One of “float64”, “float32”, “rational64”, “rational32”, “uint16” or “abcd”.
Use out.dtype to get the NumPy dtype.
- metric: SimarrayMetric¶
Information about the metric used, as a
SimarrayMetric
instance.
- matrix_type: str¶
A string describing the matrix type.
One of the strings “N”, “NxM”, “NxN”, or “upper-triangular”, describing the contents of self.out. It is:
“N” if a 1-D vector containing the comparisons between a single query fingerprint and set of targets;
“NxM” if a 2-D array containing the comparisons between a set of queries and a set of targets;
“NxN” if a 2-D array containing the full comparisons between a set of fingerprint and itself
“upper-triangular” if a 2-D array containing the diagonal and upper-triangle comparisons between a set of fingerprint and itself. The lower triangle is left as the default zero value.
- get_metadata(shape: Tuple[int, int] | Tuple[int, int, int] | None = None) dict ¶
Return a dictionary containing the metadata.
By default the “shape” comes from out.shape. Specify shape to use a different value, which must have the same dimensionality as out.shape. This is used to create the a metadata file with the correct dimensions as the full array.
- get_out_description() str ¶
A string describing the contents of out
- save(destination: str | bytes | Path | None | BinaryIO, *, format: Literal['npy', 'bin'] | None = None, include_values: bool = True, include_ids: bool = True, metadata: dict | None = None, query_ids: Sequence[str] | None = None, target_ids: Sequence[str] | None = None) None ¶
Save the array to the specified destination
The destination may be None for stdout, a seekable binary file object, or a filename as a string, bytes, or Path.
If the output format is “npy” then up to four consecutive arrays will be written to the output file, starting with the similarity NumPy array itself. If include_values is True (the default) then all of the comparison values are written. If False then an empty array (with shape (0,) or (0,0)) is written.
The next array will be contain the search parameters as an array containing a single JSON-encoded string. See
simarray_io
for examples.Use metadata to specify the metadata dictionary to use. If not specified the value will come from self.get_metadata().
If include_ids is True then the third array contains the target identifiers, or the fingerprint identifiers if an NxN matrix; and if an NxM matrix then the fourth array contains the query identifiers.
Use query_ids and target_ids to specify the corresponding identifiers, instead of the default of self.query_ids and self.target_ids.
If the output format is “bin” then the comparison matrix will be written to the destination as a raw binary bytes.
- Parameters:
destination (None for stdout, a binary output file object, or a filename/path) – the output destination
format ("npy" for npy format, "bin" for binary format, None uses the filename extension and defaults to "npy".) – the output format
include_values (bool) – if True (the default), include the similarity values, else use a zero-sized vector or matrix. Both preserve the numpy dtype.
include_ids (bool) – if True (the default), also write the query and target ids
shape (a 1-element or 2-element tuple) – a value to use as the metadata shape instead of out.shape
query_ids (a sequence of strings, typically from arena.ids) – the query identifiers to use instead of self.query_ids
target_ids (a sequence of strings, typically freom arena.ids) – the target identifiers to use instead of self.target_ids
- Returns:
None
- class chemfp.search.SimarrayDType(name: str, numpy_dtype)¶
Bases:
object
Information about a simarray dtype
- name¶
The chemfp simarray dtype name, which is one of “float64”, “float32”, “rational64”, “rational32”, “uint16” or “abcd”.
- numpy_dtype¶
A NumPy dtype corresponding to the given simarray dtype name.
- class chemfp.search.SimarrayMetric(metric_config: _SimarrayMetricConfig, as_distance: bool, is_similarity: bool, is_distance: bool)¶
Bases:
object
A description of the specific metric used for the simarray calculation.
The available metric names are “Tanimoto”, “Dice”, “cosine”, “Hamming”, “Sheffield”, “Willett”, and “Daylight”.
“Tanimoto”, “Dice”, and “cosine” compute a similarity score when as_distance is True, and 1.0 - the score when as_distance is False.
“Hamming” always computes a distance score.
is_similarity is True if the value is a similarity score, is_distance is True if the value is a distance. Both are False for “Sheffield”, “Willett” and “Daylight “, which compute 4 values per pair, written as “a”, “b”, “c”, and “d”.
For “Sheffield”, “a” is the number of 1-bits in common, “b” is the number of 1-bits in the first fingerprint which are not in the second fingerprint, “c” is the number of 1-bits in the second fingerprint which are not in the first fingerprint, and “d” is the number of 0-bits in common (a+b+c+d = the number of bits in the fingerprint).
For “Willet”, “a” is the number of 1-bits in the first fingerprint, “b” is the number of 1-bits in the second fingerprint, “c” is the number of 1-bits in common, and “d” is the number of 0-bits in common (a+b-c+d = the number of bits in the fingerprint).
For “Daylight”, “a” is the number of 1-bits unique to the first fingerprint, “b” is the number of 1-bits unique to the second fingerprint, “c” is the number of 1-bits in common, and “d” is the number of 0-bits in common (a+b+c+d = the number of bits in the fingerprint).
- Parameters:
name (a string) – the metric name (eg, “Tanimoto” or “Hamming”)
as_distance (bool) – True if a similarity score was used to compute a distance, else False
is_similarity (bool) – True if the computed value is a similarity score, else False
is_distance (bool) – True if the computed value is a distance, else False
- property default_dtype: Literal['float64', 'float32', 'rational64', 'rational32', 'uint16', 'abcd']¶
Return the default chemfp dtype for this metric
- get_description() str ¶
Return a human-readable string describing the metric
The string describes if it is a similarity or distance value, and it describes how a similarity score was converted to a distance.
Examples:
"Tanimoto similarity" (for name="Tanimoto" and as_distance=False) "1-Tanimoto distance" (for name="Tanimoto" and as_distance=True) "Hamming distance" (for name="Hamming") "Sheffield values" (for name="Sheffield")
- get_method() str ¶
Return a terse string describing the metric name and as_distance value
For similarity scores this returns the metric name if a similarity score was used as a similarity (like “Tanimoto”) and adds an extra parameter if the similarity score was used to compute a distance (like “Tanimoto as_distance=1”).
For other metrics, this returns the metric name.
- property supported_dtypes: list[Literal['float64', 'float32', 'rational64', 'rational32', 'uint16', 'abcd']]¶
Return the list of supported chemfp dtypes for this metric
- to_dict() dict[str, bool | str] ¶
Return the metric details as a dictionary
The dictionary elements are “name”, “as_distance”, “is_similarity”, and “is_distance”, with values from the corresponding instance attribute.
- class chemfp.search.SimarrayProcessor¶
Bases:
SimarrayContent
,SimarrayProcessor
Set up and process the scores between a query and target fingerprints as a NumPy array
- Use one of the three class methods to create a SimarrayProcessor:
from_query_fp()
- generate all scores for a query fingerprint and the targetsfrom_symmetric()
- generate all scores in the target fingerprints with itselffrom_NxM()
- generate all scores between the queries and the targets
Once you have a processor, use
SimarrayProcessor.next()
to have it compute at least N scores, or all the remaining scores. It returns the number of scores computed, which will be 0 when there is nothing left to process.By default the processor generates Tanimoto similarity scores. Use metric to specify an alternative, which may be one of “Tanimoto”, “cosine”, “Dice”, “Hamming”, “Sheffield”, “Willett”, or “Daylight”.
The Tanimoto, Dice, and cosine similarity metrics can be turned into a distance value using as_distance, which computes 1.0-similarity or the equivalent scaled version for the uint16 dtype.
The “Sheffield”, “Willett” and “Daylight” metrics use a structured NumPy type with uint16 fields a, b, c, and d.
For the Sheffield metric:
a = the number of 1-bits in common (the intersection popcount)
b = the number of 1-bits in the query fingerprint but not the target
c = the number of 1-bits in the target fingerprint but not the query
d = the number of 0-bits in common
Note that a+b+c+d = the number of bits.
For the Willett metric:
a = the number of 1-bits in the query fingerprint (the query popcount)
b = the number of 1-bits in the target fingerprint (the target popcount)
c = the number of 1-bits in common (the intersection popcpount)
d = the number of 0-bits in common
Note that a+b-c+d = the number of bits.
For the Daylight metric:
a = the number of 1-bits unique to the query fingerprint
b = the number of 1-bits unique to the target fingerprint
c = the number of 1-bits in common (the intersection popcpount)
d = the number of 0-bits in common
Note that a+b+c+d = the number of bits.
By default the scores are stored in a new NumPy matrix with dtype ‘float64’. Use dtype to specify an alternative output data type. The valid dtypes are float64, float32, rational64, rational32, uint16 and abcd. The rational data types are structured NumPy dtypes with fields (p, q) using uint32 for rational64 and uint16 for rational32. The uint16 is the score or distance scaled to give a value between 0 and 65535 (rounded down). The abcd dtype is the structured NumPy dtype used to store the Sheffield, Willett, and Daylight metrics.
Alternatively, you can specify an existing matrix or matrix view via the out parameter, in which case the existing matrix dtype is used to infer the desired output value.
No metric supports all of the output dtype values. The supported dtypes for each metric are:
Tanimoto and Dice: float64, float32, rational64, rational32, uint16
cosine: float64, float32, uint16
Hamming: uint16
Sheffield, Willett, and Daylight: abcd
Note! The rational32 dtype cannot be used for Dice similarity on 32768-bit fingerprints.
Use num_threads to specify how many threads to use. The default of -1 means to use the value returned by
chemfp.get_num_threads()
. Specify num_threads=1 for single-threaded (non-OpenMP) execution.- property dtype_str: Literal['float64', 'float32', 'rational64', 'rational32', 'uint16', 'abcd']¶
A string describing the output array type
- classmethod from_NxM(queries: _typing.FingerprintArena, targets: _typing.FingerprintArena, *, metric: _typing.MetricNames = 'Tanimoto', as_distance: bool = False, out: _typing.OptionalNumPyArray = None, dtype: _typing.SimarrayDType = None, num_threads: _typing.NumThreadsType = -1) SimarrayProcessor ¶
Create a
SimarrayProcessor
for each query fingerprint to each target fingerprintBy default this computes the Tanimoto similarity for each pair and and returns the results in a 2D NumPy array.
Use metric and as_distance to change which values are computed. See the
SimarrayProcessor
docstring for more details.Each metric has a default output dtype, and some metrics support optional dtypes, such as float32 or uint16 to use 4 or 2 bytes instead of the default 8 bytes of float64, or rational64 or rational32 to get both the numerator and denominator. Use dtype to pass in the specific type to use, or use out to pass in the output array, and use out.dtype to determine what to compute.
The implementation can use OpenMP to compute the values in parallel. Use num_threads = 1 to use the non-OpenMP implementation, otherwse it specifies the number of threads to use. The default value of -1 means to use the default number of threads, which is returned from
get_num_threads()
.- Parameters:
queries (a
FingerprintArena
) – the query fingerprintstargets (a
FingerprintArena
) – the target fingerprintsmetric (string) – a metric name, like “Tanimoto”, “cosine”, or “Sheffield”
as_distance (bool) – if True, convert similarity scores to a distance using 1.0-score.
out (a NumPy array, or None) – the output NumPy array to use, instead of creating a new array
dtype (a string name or NumPy dtype, or None to use the default type) – the specific output data type to compute for the given metric
num_threads (a positive integer or -1 for the default number of threads) – the number of threads to use
- Returns:
a
SimarrayProcessor
reading for processing the symmetric array.
- classmethod from_query_fp(query_fp: bytes, targets: _typing.FingerprintArena, *, metric: _typing.MetricNames = 'Tanimoto', as_distance: bool = False, out: _typing.OptionalNumPyArray = None, dtype: _typing.SimarrayDType = None, num_threads: _typing.NumThreadsType = -1)¶
Create a
SimarrayProcessor
given a query fingerprint and target arenaBy default this computes the Tanimoto between the query fingerprint and each target arena fingerprint and returns the results in a 1-D NumPy array.
Use metric and as_distance to change which values are computed. See the
SimarrayProcessor
docstring for more details.Each metric has a default output dtype, and some metrics support optional dtypes, such as float32 or uint16 to use 4 or 2 bytes instead of the default 8 bytes of float64, or rational64 or rational32 to get both the numerator and denominator. Use dtype to pass in the specific type to use, or use out to pass in the output array, and use out.dtype to determine what to compute.
The implementation can use OpenMP to compute the values in parallel. Use num_threads = 1 to use the non-OpenMP implementation, otherwse it specifies the number of threads to use. The default value of -1 means to use the default number of threads, which is returned from
get_num_threads()
.- Parameters:
query_fp (a byte string) – the query fingerprint
targets (a
FingerprintArena
) – the target fingerprintsmetric (string) – a metric name, like “Tanimoto”, “cosine”, or “Sheffield”
as_distance (bool) – if True, convert similarity scores to a distance using 1.0-score.
out (a NumPy array, or None) – the output NumPy array to use, instead of creating a new array
dtype (a string name or NumPy dtype, or None to use the default type) – the specific output data type to compute for the given metric
num_threads (a positive integer or -1 for the default number of threads) – the number of threads to use
- Returns:
a
SimarrayProcessor
reading for processing the symmetric array.
- classmethod from_symmetric(arena: _typing.FingerprintArena, *, metric: _typing.MetricNames = 'Tanimoto', as_distance: bool = False, include_lower_triangle: bool = True, out: _typing.OptionalNumPyArray = None, dtype: _typing.SimarrayDType = None, num_threads: _typing.NumThreadsType = -1) SimarrayProcessor ¶
Create a
SimarrayProcessor
for each arena fingerprint to every other fingerprintBy default this computes the Tanimoto similarity for each pair, for the full symmetric matrix. Use metric and as_distance to change which values are computed. See the
SimarrayProcessor
docstring for more details.Use include_lower_triangle = False to omit the lower triangle. The diagonal terms will still be included in the output array.
Each metric has a default output dtype, and some metrics support optional dtypes, such as float32 or uint16 to use 4 or 2 bytes instead of the default 8 bytes of float64, or rational64 or rational32 to get both the numerator and denominator. Use dtype to pass in the specific type to use, or use out to pass in the output array, and use out.dtype to determine what to compute.
The implementation can use OpenMP to compute the values in parallel. Use num_threads = 1 to use the non-OpenMP implementation, otherwse it specifies the number of threads to use. The default value of -1 means to use the default number of threads, which is returned from
get_num_threads()
.- Parameters:
arena (a
FingerprintArena
) – the fingerprints used to generate the symmetric arraymetric (string) – a metric name, like “Tanimoto”, “cosine”, or “Sheffield”
as_distance (bool) – if True, convert similarity scores to a distance using 1.0-score.
include_lower_triangle (bool) – If False, do not set the lower triangle values
out (a NumPy array, or None) – the output NumPy array to use, instead of creating a new array
dtype (a string name or NumPy dtype, or None to use the default type) – the specific output data type to compute for the given metric
num_threads (a positive integer or -1 for the default number of threads) – the number of threads to use
- Returns:
a
SimarrayProcessor
reading for processing the symmetric array.
- property matrix_type: Literal['N', 'NxM', 'NxN', 'upper-triangular']¶
return a string describing the matrix type
One of “N”, “NxM”, “NxN”, or “upper-triangular”
- property metric: SimarrayMetric¶
Information about the metric used, as a
SimarrayMetric
- next(min_count: int | None = None) int ¶
Compute at least min_count scores and return the number computed.
The method returns the number of scores actually computed, which will be at least 1 unless no scores are left to compute.
If min_count is None then the method will process all remaining scores.
- Parameters:
min_count (an integer or None) – the number of scores to compute
- Returns:
an integer count
- process_all(progress=True, batch_size=200_000_000) None ¶
Process everything, using a progress bar.
Use progress=False to not use a progress bar, or pass in a callable object as an alternative tqdm-like constructor.
The progress bar will be updated after roughly every batch_size scores.
- property query_ids¶
Return self.queries.ids if there are queries, else return None
- property target_ids¶
Return self.targets.ids
- exception chemfp.search.UnsupportedMetricValueError(name, available_metrics, suggested_name=None)¶
Bases:
ValueError
Exception raised if a specified metric name is not known
- chemfp.search.contains_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena) SearchResults ¶
Find the target fingerprints which contain the query fingerprints as a subset
A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a
chemfp.search.SearchResults
where SearchResults[i] contains all of the target fingerprints in target_arena that contain the fingerprint for entry query_arena [i].The SearchResult scores are all 0.0.
There is currently no direct way to limit the arena search range, though you can create and search a subarena by using Python’s slice notation.
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – the query fingerprintstarget_arena (a
chemfp.arena.FingerprintArena
) – the target fingerprints
- Returns:
a
chemfp.search.SearchResults
instance, of the same size as query_arena
- chemfp.search.contains_fp(query_fp: bytes, target_arena: _typing.FingerprintArena) SearchResult ¶
Find the target fingerprints which contain the query fingerprint bits as a subset
A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a
chemfp.search.SearchResult
containing all of the target fingerprints in target_arena that contain the query_fp.The SearchResult scores are all 0.0.
There is currently no direct way to limit the arena search range. Instead create a subarena by using Python’s slice notation on the arena then search the subarena.
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.
- Returns:
a SearchResult instance
- chemfp.search.count_tanimoto_hits_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, threshold: float = 0.7, num_threads: _typing.NumThreadsType = -1) _typing.SearchCounts ¶
For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_arena(queries, targets, threshold=0.1) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
an array of counts
- chemfp.search.count_tanimoto_hits_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, threshold: float = 0.7)¶
Count the number of hits in target_arena at least threshold similar to the query_fp
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tanimoto_hits_fp(query_fp, targets, threshold=0.1))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena (a
FingerprintArena
) – the target arenathreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
an integer count
- chemfp.search.count_tanimoto_hits_symmetric(arena: _typing.FingerprintArena, threshold: float = 0.7, *, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalCountsCallback = None, num_threads: _typing.NumThreadsType = -1) _typing.SearchCounts ¶
For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.2) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
batch_size (integer) – the number of rows to process before checking for a
^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
an array of counts
- chemfp.search.count_tversky_hits_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, num_threads: _typing.NumThreadsType = -1) _typing.SearchCounts ¶
For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_arena(queries, targets, threshold=0.1, alpha=0.5, beta=0.5) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
an array of counts
- chemfp.search.count_tversky_hits_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, num_threads: _typing.NumThreadsType = -1) int ¶
Count the number of hits in target_arena least threshold similar to the query_fp (Tversky)
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tversky_hits_fp(query_fp, targets, threshold=0.1))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena – the target arena
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
an integer count
- chemfp.search.count_tversky_hits_symmetric(arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalCountsCallback = None, num_threads: _typing.NumThreadsType = -1) _typing.SearchCounts ¶
For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
batch_size (integer) – the number of rows to process before checking for a
^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
an array of counts
- chemfp.search.fill_lower_triangle(results: SearchResults) None ¶
Duplicate each entry of results to its transpose
This is used after the symmetric threshold search to turn the upper-triangle results into a full matrix.
- Parameters:
results (a
chemfp.search.SearchResults
) – search results
- chemfp.search.from_csr(matrix: _typing.SciPyCSRMatrix, query_ids: _Optional[list[str]] = None, target_ids: _Optional[list[str]] = None, num_bits: int = 2147483647, alpha: float = 1.0, beta: float = 1.0, matrix_type: MatrixType = MatrixType.NxM)¶
Convert a SciPy compressed sparse row matrix to a
SearchResults
If not specified then query_ids will be list(range(num_rows)) and target_ids will be list(range(num_cols)). The `target_ids must be at least as long as the maximum index in the sparse matrix.
The num_bits, alpha and beta must match the parameters used to generate the Tanimoto or Tversky scores otherwise some of the chemfp internal optimizations may give incorrect sort results. Keep the default values if you do not know the parameters, or used another method to generate the scores.
matrix_type describes the contents of the matrix. Can be “NxM” or “no-diagonal”. The latter is only valid if the target_ids are the same object (meaning the queries and targets are the same), and if the diagonal term is not included in the matrix.
The scores must not contain infinite numbers or NaNs.
- Parameters:
matrix – a SciPy compressed sparse row (“csr”) matrix
query_ids – a list of identifiers for each row
target_ids – a list of identifiers, used to match index to target id
num_bits (an integer between 1 and 2**31-1) – the number of fingerprint bits, or a larger value
alpha (a float between 0.0 and 10.0) – the Tversky alpha value
beta (a float between 0.0 and 10.0) – the Tversky beta value
matrix_type (either 'NxM' or 'no-diagonal') – a description of the matrix contents
- Returns:
- chemfp.search.get_search_results(obj)¶
- chemfp.search.get_search_results(obj: SearchResults)
- chemfp.search.get_search_results(obj: SingleQuerySimsearch)
- chemfp.search.get_search_results(obj: MultiQuerySimsearch)
- chemfp.search.get_search_results(obj: NxNSimsearch)
Convert the given object to a SearchResults, or raise a TypeError
This is a functools.singledispatch registry hook used so the high-level simsearch() result can be passed in as a SearchResults.
- chemfp.search.get_simarray_dtype(dtype: _typing.SimarrayDType)¶
Return a
SimarrayDType
given a chemfp dtype name or NumPy dtypeThis is used to convert a simarray dtype name like “rational32” to the the corresponding NumPy, or to get the simarray dtype name for a user-defined dtype.
- chemfp.search.get_simarray_metric(name: str, as_distance: bool = False) SimarrayMetric ¶
Return a
SimarrayMetric
given the metric name and optional as_distanceThis raises a
UnsupportedMetricValueError
if the name is unknown or unsupported.- Parameters:
name (str) – the metric name
as_distance (bool) – If True then compute a distance from a similarity metric
- Returns:
- chemfp.search.get_simarray_metric_names() list[str] ¶
Return the supported metric names, as a list of strings
- chemfp.search.knearest_tanimoto_search_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, query_thresholds: _Optional[list[float]] = None, batch_size: _Optional[int] = None, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tanimoto_search_arena(queries, targets, k=3, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
Use query_thresholds to specify per-query thresholds instead of using the global threshold. The global threshold must still be in range 0.0 to 1.0.
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.k (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
query_thresholds (None or a list of Python floats, or an array of C doubles) – optionally specify per-query thresholds
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.knearest_tanimoto_search_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, num_threads: _typing.NumThreadsType = -1) SearchResult ¶
Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResult
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tanimoto_search_fp(query_fp, targets, k=3, threshold=0.0)))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena (a
chemfp.arena.FingerprintArena
) – the target arenak (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.knearest_tanimoto_search_symmetric(arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, query_thresholds: _Optional[list[float]] = None, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.8) for (query_id, hits) in zip(arena.ids, results): hits_str = ", ".join(f"{id} {score:.2f}" for (id, score) in hits.get_ids_and_scores()) print(f"{query_id} -> {hits_str}")
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsk (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
query_thresholds (None or a list of Python floats, or an array of C doubles) – optionally specify per-query thresholds
batch_size (integer) – the number of rows to process before checking for a ^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.knearest_tversky_search_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0, query_thresholds: _Optional[list[float]] = None, batch_size: _Optional[int] = None, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tversky_search_arena( queries, targets, k=3, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.k (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.knearest_tversky_search_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0) SearchResult ¶
Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResult
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tversky_search_fp( query_fp, targets, k=3, threshold=0.0, alpha=0.5, beta=0.5)))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena – the target arena
k (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- Returns:
- chemfp.search.knearest_tversky_search_symmetric(arena: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0, query_thresholds: _Optional[list[float]] = None, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tversky_search_symmetric( arena, k=3, threshold=0.8, alpha=0.5, beta=0.5) for (query_id, hits) in zip(arena.ids, results): hits_str = ", ".join((f"{id} {score:.2f}" for (id, score) in hits.get_ids_and_scores())) print(f"{query_id} -> {hits_str}")
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsk (positive integer) – the number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
batch_size (integer) – the number of rows to process before checking for a ^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.load_npz(source: str | bytes | Path | None | BinaryIO, allow_pickle: bool = False)¶
Load a SearchResult or SearchResults from a NumPy npz file.
The npz file must follow the SciPy compressed sparse row (“csr”) matrix format. It may also contain optional arrays to store the query and target identifiers, as generated by
save_npz()
.WARNING: NumPy uses the pickle module to store arrays with Python objects like None. The pickle mechanism is NOT SECURE. Quoting https://docs.python.org/3/library/pickle.html
‘It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.’
If the npz file is from a trusted source and contains pickled data then use allow_pickle=True.
- Parameters:
source (a filename, binary file object with a "read" method, or a path object) – where the data will be read
allow_pickle (a boolean) – if True, allow NumPy object arrays stored as a pickle
- Returns:
a
SearchResult
orSearchResults
, depending on the content.
- chemfp.search.partial_count_tanimoto_hits_symmetric(counts: _typing.SearchCounts, arena: _typing.FingerprintArena, threshold: float = 0.7, query_start: int = 0, query_end: _Optional[int] = None, target_start: int = 0, target_end: _Optional[int] = None, num_threads: _typing.NumThreadsType = -1) None ¶
Compute a portion of the symmetric Tanimoto counts
For most cases, use
chemfp.search.count_tanimoto_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_count_tanimoto_hits_symmetric, counts, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) print(counts)
- Parameters:
counts (a contiguous block of integer) – the accumulated Tanimoto counts
arena (a
chemfp.arena.FingerprintArena
) – the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
query_start (an integer) – the query start row
query_end (an integer, or None to mean the last query row) – the query end row
target_start (an integer) – the target start row
target_end (an integer, or None to mean the last target row) – the target end row
- Returns:
None
- chemfp.search.partial_count_tversky_hits_symmetric(counts: _typing.SearchCounts, arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, query_start: int = 0, query_end: _Optional[int] = None, target_start: int = 0, target_end: _Optional[int] = None, num_threads: _typing.NumThreadsType = -1) None ¶
Compute a portion of the symmetric Tversky counts
For most cases, use
chemfp.search.count_tversky_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_count_tversky_hits_symmetric, counts, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) print(counts)
- Parameters:
counts (a contiguous block of integer) – the accumulated Tversky counts
arena (a
chemfp.arena.FingerprintArena
) – the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
query_start (an integer) – the query start row
query_end (an integer, or None to mean the last query row) – the query end row
target_start (an integer) – the target start row
target_end (an integer, or None to mean the last target row) – the target end row
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
None
- chemfp.search.partial_threshold_tanimoto_search_symmetric(results: SearchResults, arena: _typing.FingerprintArena, threshold: float = 0.7, query_start: int = 0, query_end: _Optional[int] = None, target_start: int = 0, target_end: _Optional[int] = None, results_offset: int = 0, num_threads: _typing.NumThreadsType = -1) None ¶
Compute a portion of the symmetric Tanimoto search results
For most cases, use
chemfp.search.threshold_tanimoto_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp from chemfp.search import (SearchResults, MatrixType, partial_threshold_tanimoto_search_symmetric) from concurrent.futures import ThreadPoolExecutor chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = SearchResults(n, n, query_ids=arena.ids, target_ids=arena.ids, matrix_type=MatrixType.NO_DIAGONAL) with ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit( partial_threshold_tanimoto_search_symmetric, results, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.- Parameters:
results (a
chemfp.search.SearchResults
instance) – the intermediate search resultsarena (a
chemfp.arena.FingerprintArena
) – the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
query_start (an integer) – the query start row
query_end (an integer, or None to mean the last query row) – the query end row
target_start (an integer) – the target start row
target_end (an integer, or None to mean the last target row) – the target end row
results_offset – use results[results_offset] as the base for the results
results_offset – an integer
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
None
- chemfp.search.partial_threshold_tversky_search_symmetric(results: SearchResults, arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, query_start: int = 0, query_end: _Optional[int] = None, target_start: int = 0, target_end: _Optional[int] = None, results_offset: int = 0, num_threads: _typing.NumThreadsType = -1) None ¶
Compute a portion of the symmetric Tversky search results
For most cases, use
chemfp.search.threshold_tversky_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp from chemfp.search import (SearchResults, MatrixType, partial_threshold_tversky_search_symmetric) from concurrent.futures import ThreadPoolExecutor chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = SearchResults( n, n, query_ids=arena.ids, target_ids=arena.ids, matrix_type = MatrixType.STRICTLY_UPPER_TRIANGULAR) with ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit( partial_threshold_tversky_search_symmetric, results, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.- Parameters:
counts (an SearchResults instance) – the intermediate search results
arena (a
chemfp.arena.FingerprintArena
) – the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
query_start (an integer) – the query start row
query_end (an integer, or None to mean the last query row) – the query end row
target_start (an integer) – the target start row
target_end (an integer, or None to mean the last target row) – the target end row
results_offset – use results[results_offset] as the base for the results
results_offset – an integer
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
None
- chemfp.search.save_npz(destination: str | bytes | Path | None | BinaryIO, result: SearchResult | SearchResults, compressed: bool = True)¶
Save a SearchResult or SearchResults to a NumPy npz file.
Chemfp stores search results in an “npz” following the layout for a SciPy compressed sparse row (“csr”) matrix.
A “npz” file is a zipfile containing “npy” files. Each npy file contains a NumPy array in a format which is easy and quick to load. The entries for a csr matrix are:
“format.npy”: an array containing the word “csc”
“shape.npy”: an array containing (num_rows, num_columns)
“indptr.npy”: an array of num_rows+1, pointing to the [start, end) of each row (dtype=int32)
“indices.npy”: an array of sparse element indices (dtype=int32)
“data.npy”: an array of sparse element values, in this case, scores (dtype=double)
Chemfp adds several optional additional arrays:
“chemfp.npy”: a JSON dictionary containing search result parameters
“ids.npy”: contains the query and target ids, if they are identical
“query_ids.npy”: the query ids
“target_ids.npy”: the target ids
- Parameters:
file (a filename, binary file object with a "write" method, or a path object) – where the data will be written
result (a
SearchResult
orSearchResults
or the result ofsimsearch()
) – the chemfp object to savecompressed (a boolean) – if True, compress the arrays.
- chemfp.search.threshold_tanimoto_search_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, threshold: float = 0.7, batch_size: _Optional[int] = None, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tanimoto_search_arena(queries, targets, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
- chemfp.search.threshold_tanimoto_search_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, threshold: float = 0.7) SearchResult ¶
Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tanimoto_search_fp(query_fp, targets, threshold=0.15)))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena (a
chemfp.arena.FingerprintArena
) – the target arenathreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
- chemfp.search.threshold_tanimoto_search_symmetric(arena: _typing.FingerprintArena, threshold: float = 0.7, include_lower_triangle: bool = True, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.2) upper_triangle = chemfp.search.threshold_tanimoto_search_symmetric( arena, threshold=0.2, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
batch_size (integer) – the number of rows to process before checking for a ^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.threshold_tversky_search_arena(query_arena: _typing.FingerprintArena, target_arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, batch_size: _Optional[int] = None, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tversky_search_arena( queries, targets, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
- Parameters:
query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints.target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns:
- chemfp.search.threshold_tversky_search_fp(query_fp: bytes, target_arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) SearchResult ¶
Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tversky_search_fp( query_fp, targets, threshold=0.15, alpha=0.5, beta=0.5)))
- Parameters:
query_fp (a byte string) – the query fingerprint
target_arena (a
chemfp.arena.FingerprintArena
) – the target arenathreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- Returns:
- chemfp.search.threshold_tversky_search_symmetric(arena: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, include_lower_triangle: bool = True, batch_size: int = DEFAULT_BATCH_SIZE, batch_callback: _typing.OptionalResultsCallback = None, num_threads: _typing.NumThreadsType = -1) SearchResults ¶
Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) upper_triangle = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprintsthreshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
include_lower_triangle (boolean) – if False and alpha == beta, compute only the upper triangle, otherwise use symmetry to compute the full matrix
batch_size (integer) – the number of rows to process before checking for a ^C
num_threads (a non-negative integer, or -1 to use the value of
get_num_threads()
.) – The number of OpenMP threads to use (if available)
- Returns: