chemfp.highlevel.similarity module¶
This module should not be imported directly.
It contains internal implementation details of the high-level API available from the top-level chemfp module.
This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.
- class chemfp.highlevel.similarity.BaseSimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶
Bases:
object
This is the base class for the objects returned by
simsearch()
It contains the query parameters, search results, and timings.
In addition, it is a context manager for any files which may have been opened.
- close()¶
Close any associated files
- get_description()¶
Return a human-readable description of the simsearch run
- property matrix_type¶
- property matrix_type_name¶
- property out¶
an API experiment to see if “out” is a better name than “result”.
- property target_ids¶
Return the target identifiers
- class chemfp.highlevel.similarity.MultiQuerySimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶
Bases:
BaseSimsearch
- count_all(min_score=None, max_score=None, interval='[]')¶
Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count_all(). See
SearchResults.count_all()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
an integer count
- cumulative_score_all(min_score=None, max_score=None, interval='[]')¶
The sum of all scores in all rows which are between min_score and max_score
Shortcut for obj.result.cumulative_score_all(). See
SearchResults.cumulative_score_all()
.Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
a floating point count
- iter_ids()¶
For each hit, yield the list of target identifiers
Shortcut for obj.result.iter_ids(). See
SearchResults.iter_ids()
.
- iter_ids_and_scores()¶
For each hit, yield the list of (target id, score) tuples
Shortcut for obj.result.iter_ids_and_scores(). See
SearchResults.iter_ids_and_scores()
.
- iter_indices()¶
For each hit, yield the list of target indices
Shortcut for obj.result.iter_indices(). See
SearchResults.iter_indices()
.
- iter_indices_and_scores()¶
For each hit, yield the list of (target index, score) tuples
Shortcut for obj.result.iter_indices_and_scores(). See
SearchResults.iter_indices_and_scores()
.
- iter_scores()¶
For each hit, yield the list of target scores
Shortcut for obj.result.iter_scores(). See
SearchResults.iter_scores()
.
- property query_ids¶
- reorder_all(order='decreasing-score-plus')¶
Reorder the hits for all of the rows based on the requested order.
Shortcut for obj.result.reorder_all(). See
SearchResults.reorder_all()
.The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- save(destination, format=None, compressed=True)¶
Save the SearchResults to the given destination
Shortcut for obj.result.save(). See
SearchResults.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.
Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.
- Parameters:
destination (a filename, binary file object, or None for stdout) – where to write the results
format (None or 'npz') – the output format name (default: always ‘npz’)
compressed – if True (the default), use zipfile compression
- property shape: Tuple[int, int]¶
the tuple (number of rows, number of columns)
return the (number of queries, number of targets)
The number of columns is the size of the target arena.
- to_csr(dtype=None)¶
Return the results as a SciPy compressed sparse row matrix.
Shortcut for obj.result.to_csr(). See
SearchResults.to_csr()
.By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that SciPy (and NumPy) be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_numpy_array(dtype=None)¶
Return the results as a (dense) NumPy array
Shortcut for obj.result.to_numpy_array(). See
SearchResults.to_numpy_array()
.The returned matrix has the same shape as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that NumPy be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_pandas(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶
Return a pandas DataFrame with query_id, target_id and score columns
Shortcut for obj.result.to_pandas(). See
SearchResults.to_pandas()
.Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
- Parameters:
columns (a list of three strings) – column names for the returned DataFrame
empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
- Returns:
a pandas DataFrame
- class chemfp.highlevel.similarity.NxNSimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶
Bases:
BaseSimsearch
- count_all(min_score=None, max_score=None, interval='[]')¶
Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count_all(). See
SearchResults.count_all()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
an integer count
- cumulative_score_all(min_score=None, max_score=None, interval='[]')¶
The sum of all scores in all rows which are between min_score and max_score
Shortcut for obj.result.cumulative_score_all(). See
SearchResults.cumulative_score_all()
.Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
a floating point count
- iter_ids()¶
For each hit, yield the list of target identifiers
Shortcut for obj.result.iter_ids(). See
SearchResults.iter_ids()
.
- iter_ids_and_scores()¶
For each hit, yield the list of (target id, score) tuples
Shortcut for obj.result.iter_ids_and_scores(). See
SearchResults.iter_ids_and_scores()
.
- iter_indices()¶
For each hit, yield the list of target indices
Shortcut for obj.result.iter_indices(). See
SearchResults.iter_indices()
.
- iter_indices_and_scores()¶
For each hit, yield the list of (target index, score) tuples
Shortcut for obj.result.iter_indices_and_scores(). See
SearchResults.iter_indices_and_scores()
.
- iter_scores()¶
For each hit, yield the list of target scores
Shortcut for obj.result.iter_scores(). See
SearchResults.iter_scores()
.
- property query_ids¶
- reorder_all(order='decreasing-score-plus')¶
Reorder the hits for all of the rows based on the requested order.
Shortcut for obj.result.reorder_all(). See
SearchResults.reorder_all()
.The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- save(destination, format=None, compressed=True)¶
Save the SearchResults to the given destination
Shortcut for obj.result.save(). See
SearchResults.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.
Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.
- Parameters:
destination (a filename, binary file object, or None for stdout) – where to write the results
format (None or 'npz') – the output format name (default: always ‘npz’)
compressed – if True (the default), use zipfile compression
- property shape¶
the tuple (number of rows, number of columns)
Shortcut for obj.result.shape(). See
SearchResults.shape()
.The number of columns is the size of the target arena.
- to_csr(dtype=None)¶
Return the results as a SciPy compressed sparse row matrix.
Shortcut for obj.result.to_csr(). See
SearchResults.to_csr()
.By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that SciPy (and NumPy) be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_numpy_array(dtype=None)¶
Return the results as a (dense) NumPy array
Shortcut for obj.result.to_numpy_array(). See
SearchResults.to_numpy_array()
.The returned matrix has the same shape as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.
This method requires that NumPy be installed.
- Parameters:
dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)
- to_pandas(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶
Return a pandas DataFrame with query_id, target_id and score columns
Shortcut for obj.result.to_pandas(). See
SearchResults.to_pandas()
.Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
- Parameters:
columns (a list of three strings) – column names for the returned DataFrame
empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
- Returns:
a pandas DataFrame
- class chemfp.highlevel.similarity.SingleQuerySimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶
Bases:
BaseSimsearch
- as_buffer()¶
Return a Python buffer object for the underlying indices and scores.
Shortcut for obj.result.as_buffer(). See
SearchResult.as_buffer()
.This provides a byte-oriented view of the raw data. You probably want to use as_ctypes() or as_numpy_array() to get the indices and scores in a more structured form.
Warning
Do not attempt to access the buffer contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
- Returns:
a Python buffer object
- as_ctypes()¶
Return a ctypes view of the underlying indices and scores
Shortcut for obj.result.as_ctypes(). See
SearchResult.as_ctypes()
.Each (index, score) pair is represented as a ctypes structure named Hit with fields index (c_int) and score (c_double).
For example, to get the score of the 5th entry use:
result.as_ctypes()[4].score
This method returns an array of type (Hit*len(search_result)). Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the ctype array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.
- Returns:
a ctypes array of type Hit*len(self)
- as_numpy_array()¶
Return a NumPy array view of the underlying indices and scores
Shortcut for obj.result.as_numpy_array(). See
SearchResult.as_numpy_array()
.The view uses a structured types with fields ‘index’ (i4) and ‘score’ (f8), mapped directly onto chemfp’s own data structure. For example, to get the score of the 4th entry use:
result.as_numpy_array()["score"][3] -or- result.as_numpy_array()[3][1]
Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the NumPy array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
As a short-hand to get just the indices or just the scores, use get_indices_as_numpy_array() or get_scores_as_numpy_array().
- Returns:
a NumPy array with a structured data type
- count(min_score=None, max_score=None, interval='[]')¶
Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count(). See
SearchResult.count()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
an integer count
- cumulative_score(min_score=None, max_score=None, interval='[]')¶
The sum of the scores which are between min_score and max_score
Shortcut for obj.result.cumulative_score(). See
SearchResult.cumulative_score()
.Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
- Parameters:
min_score (a float, or None for -infinity) – the minimum score in the range.
max_score (a float, or None for +infinity) – the maximum score in the range.
interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
- Returns:
a floating point value
- format_ids_and_scores_as_bytes(ids=None, precision=4)¶
Format the ids and scores as the byte string needed for simsearch output
Shortcut for obj.result.format_ids_and_scores_as_bytes(). See
SearchResult.format_ids_and_scores_as_bytes()
.If there are no hits then the result is the empty string b””, otherwise it returns a byte string containing the tab-seperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], …
If the ids is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings.
The precision sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive.
This function is 3-4x faster than the Python equivalent, which is roughly:
ids = ids if (ids is not None) else self.get_ids() formatter = ("%s\t%." + str(precision) + "f").encode("ascii") return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores()))
- Parameters:
ids (a list of Unicode strings, or None to use the default) – the identifiers to use for each hit.
precision (an integer from 1 to 10, inclusive) – the precision to use for each score
- Returns:
a byte string
- get_ids()¶
The list of target identifiers (if available), in the current ordering
Shortcut for obj.result.get_ids(). See
SearchResult.get_ids()
.- Returns:
a list of strings
- get_ids_and_scores()¶
The list of (target identifier, target score) pairs, in the current ordering
Shortcut for obj.result.get_ids_and_scores(). See
SearchResult.get_ids_and_scores()
.Raises a TypeError if the target IDs are not available.
- Returns:
a Python list of 2-element tuples
- get_indices()¶
The list of target indices, in the current ordering.
Shortcut for obj.result.get_indices(). See
SearchResult.get_indices()
.This returns a copy of the scores. See
get_indices_as_numpy_array()
to get a NumPy array view of the indices.- Returns:
an array.array() of type ‘i’
- get_indices_and_scores()¶
The list of (target index, target score) pairs, in the current ordering
Shortcut for obj.result.get_indices_and_scores(). See
SearchResult.get_indices_and_scores()
.- Returns:
a Python list of 2-element tuples
- get_indices_as_numpy_array()¶
Return a NumPy array view of the underlying indices.
Shortcut for obj.result.get_indices_as_numpy_array(). See
SearchResult.get_indices_as_numpy_array()
.This is a short-cut for self.as_numpy_array()[“index”]. See that method documentation for details and warning.
- Returns:
a NumPy array of type ‘i4’
- get_scores()¶
The list of target scores, in the current ordering
Shortcut for obj.result.get_scores(). See
SearchResult.get_scores()
.This returns a copy of the scores. See
get_scores_as_numpy_array()
to get a NumPy array view of the scores.- Returns:
an array.array() of type ‘d’
- get_scores_as_numpy_array()¶
Return a NumPy array view of the underlying scores.
Shortcut for obj.result.get_scores_as_numpy_array(). See
SearchResult.get_scores_as_numpy_array()
.This is a short-cut for self.as_numpy_array()[“score”]. See that method documentation for details and warning.
- Returns:
a NumPy array of type ‘f8’
- iter_ids()¶
Iterate over target identifiers (if available), in the current ordering
Shortcut for obj.result.iter_ids(). See
SearchResult.iter_ids()
.
- max()¶
Return the value of the largest score
Shortcut for obj.result.max(). See
SearchResult.max()
.Returns 0.0 if there are no results.
- Returns:
a float
- min()¶
Return the value of the smallest score
Shortcut for obj.result.min(). See
SearchResult.min()
.Returns 0.0 if there are no results.
- Returns:
a float
- property query_id¶
Return the corresponding query id, if available, else None
Shortcut for simsearch.result.query_id. See
SearchResult.query_id
.
- reorder(ordering='decreasing-score-plus')¶
Reorder the hits based on the requested ordering.
Shortcut for obj.result.reorder(). See
SearchResult.reorder()
.The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- save(destination, format=None, compressed=True)¶
Save the SearchResult to the given destination
Shortcut for obj.result.save(). See
SearchResult.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResult is stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a single SearchResult.
Use chemfp.search.load_npz() to read the similarity search result back into a SearchResult instance.
- Parameters:
destination (a filename, binary file object, or None for stdout) – where to write the results
format (None or 'npz') – the output format name (default: always ‘npz’)
compressed – if True (the default), use zipfile compression
- property shape: Tuple[int]¶
return the (number of targets,)
- to_pandas(*, columns=['target_id', 'score'])¶
Return a pandas DataFrame with the target ids and scores
Shortcut for obj.result.to_pandas(). See
SearchResult.to_pandas()
.The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame