chemfp.highlevel.similarity module

This module should not be imported directly.

It contains internal implementation details of the high-level API available from the top-level chemfp module.

This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class chemfp.highlevel.similarity.BaseSimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)

Bases: object

This is the base class for the objects returned by simsearch()

It contains the query parameters, search results, and timings.

In addition, it is a context manager for any files which may have been opened.

close()

Close any associated files

get_description()

Return a human-readable description of the simsearch run

property matrix_type
property matrix_type_name
property out

an API experiment to see if “out” is a better name than “result”.

property target_ids

Return the target identifiers

class chemfp.highlevel.similarity.MultiQuerySimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)

Bases: BaseSimsearch

count_all(min_score=None, max_score=None, interval='[]')

Count the number of hits with a score between min_score and max_score

Shortcut for obj.result.count_all(). See SearchResults.count_all().

Using the default parameters this returns the number of hits in the result.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

an integer count

cumulative_score_all(min_score=None, max_score=None, interval='[]')

The sum of all scores in all rows which are between min_score and max_score

Shortcut for obj.result.cumulative_score_all(). See SearchResults.cumulative_score_all().

Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

a floating point count

iter_ids()

For each hit, yield the list of target identifiers

Shortcut for obj.result.iter_ids(). See SearchResults.iter_ids().

iter_ids_and_scores()

For each hit, yield the list of (target id, score) tuples

Shortcut for obj.result.iter_ids_and_scores(). See SearchResults.iter_ids_and_scores().

iter_indices()

For each hit, yield the list of target indices

Shortcut for obj.result.iter_indices(). See SearchResults.iter_indices().

iter_indices_and_scores()

For each hit, yield the list of (target index, score) tuples

Shortcut for obj.result.iter_indices_and_scores(). See SearchResults.iter_indices_and_scores().

iter_scores()

For each hit, yield the list of target scores

Shortcut for obj.result.iter_scores(). See SearchResults.iter_scores().

property query_ids
reorder_all(order='decreasing-score-plus')

Reorder the hits for all of the rows based on the requested order.

Shortcut for obj.result.reorder_all(). See SearchResults.reorder_all().

The available orderings are:

  • increasing-score - sort by increasing score

  • decreasing-score - sort by decreasing score

  • increasing-score-plus - sort by increasing score, break ties by increasing index

  • decreasing-score-plus - sort by decreasing score, break ties by increasing index

  • increasing-index - sort by increasing target index

  • decreasing-index - sort by decreasing target index

  • move-closest-first - move the hit with the highest score to the first position

  • reverse - reverse the current ordering

Parameters:

ordering (string) – the name of the ordering to use

save(destination, format=None, compressed=True)

Save the SearchResults to the given destination

Shortcut for obj.result.save(). See SearchResults.save().

Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().

Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.

Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.

Parameters:
  • destination (a filename, binary file object, or None for stdout) – where to write the results

  • format (None or 'npz') – the output format name (default: always ‘npz’)

  • compressed – if True (the default), use zipfile compression

property shape: Tuple[int, int]

the tuple (number of rows, number of columns)

return the (number of queries, number of targets)

The number of columns is the size of the target arena.

to_csr(dtype=None)

Return the results as a SciPy compressed sparse row matrix.

Shortcut for obj.result.to_csr(). See SearchResults.to_csr().

By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.

This method requires that SciPy (and NumPy) be installed.

Parameters:

dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_numpy_array(dtype=None)

Return the results as a (dense) NumPy array

Shortcut for obj.result.to_numpy_array(). See SearchResults.to_numpy_array().

The returned matrix has the same shape as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.

By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.

This method requires that NumPy be installed.

Parameters:

dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_pandas(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))

Return a pandas DataFrame with query_id, target_id and score columns

Shortcut for obj.result.to_pandas(). See SearchResults.to_pandas().

Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.

If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).

Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.

Use the DataFrame’s groupby() method to group results by query id, for example:

>>> import chemfp
>>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps",
...        k=10, threshold=0.4, progress=False).to_pandas()
>>> df.groupby("query_id").describe()
Parameters:
  • columns (a list of three strings) – column names for the returned DataFrame

  • empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case

Returns:

a pandas DataFrame

class chemfp.highlevel.similarity.NxNSimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)

Bases: BaseSimsearch

count_all(min_score=None, max_score=None, interval='[]')

Count the number of hits with a score between min_score and max_score

Shortcut for obj.result.count_all(). See SearchResults.count_all().

Using the default parameters this returns the number of hits in the result.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

an integer count

cumulative_score_all(min_score=None, max_score=None, interval='[]')

The sum of all scores in all rows which are between min_score and max_score

Shortcut for obj.result.cumulative_score_all(). See SearchResults.cumulative_score_all().

Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

a floating point count

iter_ids()

For each hit, yield the list of target identifiers

Shortcut for obj.result.iter_ids(). See SearchResults.iter_ids().

iter_ids_and_scores()

For each hit, yield the list of (target id, score) tuples

Shortcut for obj.result.iter_ids_and_scores(). See SearchResults.iter_ids_and_scores().

iter_indices()

For each hit, yield the list of target indices

Shortcut for obj.result.iter_indices(). See SearchResults.iter_indices().

iter_indices_and_scores()

For each hit, yield the list of (target index, score) tuples

Shortcut for obj.result.iter_indices_and_scores(). See SearchResults.iter_indices_and_scores().

iter_scores()

For each hit, yield the list of target scores

Shortcut for obj.result.iter_scores(). See SearchResults.iter_scores().

property query_ids
reorder_all(order='decreasing-score-plus')

Reorder the hits for all of the rows based on the requested order.

Shortcut for obj.result.reorder_all(). See SearchResults.reorder_all().

The available orderings are:

  • increasing-score - sort by increasing score

  • decreasing-score - sort by decreasing score

  • increasing-score-plus - sort by increasing score, break ties by increasing index

  • decreasing-score-plus - sort by decreasing score, break ties by increasing index

  • increasing-index - sort by increasing target index

  • decreasing-index - sort by decreasing target index

  • move-closest-first - move the hit with the highest score to the first position

  • reverse - reverse the current ordering

Parameters:

ordering (string) – the name of the ordering to use

save(destination, format=None, compressed=True)

Save the SearchResults to the given destination

Shortcut for obj.result.save(). See SearchResults.save().

Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().

Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.

Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.

Parameters:
  • destination (a filename, binary file object, or None for stdout) – where to write the results

  • format (None or 'npz') – the output format name (default: always ‘npz’)

  • compressed – if True (the default), use zipfile compression

property shape

the tuple (number of rows, number of columns)

Shortcut for obj.result.shape(). See SearchResults.shape().

The number of columns is the size of the target arena.

to_csr(dtype=None)

Return the results as a SciPy compressed sparse row matrix.

Shortcut for obj.result.to_csr(). See SearchResults.to_csr().

By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.

This method requires that SciPy (and NumPy) be installed.

Parameters:

dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_numpy_array(dtype=None)

Return the results as a (dense) NumPy array

Shortcut for obj.result.to_numpy_array(). See SearchResults.to_numpy_array().

The returned matrix has the same shape as the SearchResults instance and can be passed into, for example, a scikit-learn clustering algorithm.

By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int -> double -> float.

This method requires that NumPy be installed.

Parameters:

dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_pandas(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))

Return a pandas DataFrame with query_id, target_id and score columns

Shortcut for obj.result.to_pandas(). See SearchResults.to_pandas().

Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.

If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).

Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.

Use the DataFrame’s groupby() method to group results by query id, for example:

>>> import chemfp
>>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps",
...        k=10, threshold=0.4, progress=False).to_pandas()
>>> df.groupby("query_id").describe()
Parameters:
  • columns (a list of three strings) – column names for the returned DataFrame

  • empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case

Returns:

a pandas DataFrame

class chemfp.highlevel.similarity.SingleQuerySimsearch(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)

Bases: BaseSimsearch

as_buffer()

Return a Python buffer object for the underlying indices and scores.

Shortcut for obj.result.as_buffer(). See SearchResult.as_buffer().

This provides a byte-oriented view of the raw data. You probably want to use as_ctypes() or as_numpy_array() to get the indices and scores in a more structured form.

Warning

Do not attempt to access the buffer contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.

Returns:

a Python buffer object

as_ctypes()

Return a ctypes view of the underlying indices and scores

Shortcut for obj.result.as_ctypes(). See SearchResult.as_ctypes().

Each (index, score) pair is represented as a ctypes structure named Hit with fields index (c_int) and score (c_double).

For example, to get the score of the 5th entry use:

result.as_ctypes()[4].score

This method returns an array of type (Hit*len(search_result)). Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!

Warning

Do not attempt to access the ctype array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.

This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.

Returns:

a ctypes array of type Hit*len(self)

as_numpy_array()

Return a NumPy array view of the underlying indices and scores

Shortcut for obj.result.as_numpy_array(). See SearchResult.as_numpy_array().

The view uses a structured types with fields ‘index’ (i4) and ‘score’ (f8), mapped directly onto chemfp’s own data structure. For example, to get the score of the 4th entry use:

result.as_numpy_array()["score"][3]
    -or-
result.as_numpy_array()[3][1]

Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!

Warning

Do not attempt to access the NumPy array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.

As a short-hand to get just the indices or just the scores, use get_indices_as_numpy_array() or get_scores_as_numpy_array().

Returns:

a NumPy array with a structured data type

count(min_score=None, max_score=None, interval='[]')

Count the number of hits with a score between min_score and max_score

Shortcut for obj.result.count(). See SearchResult.count().

Using the default parameters this returns the number of hits in the result.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

an integer count

cumulative_score(min_score=None, max_score=None, interval='[]')

The sum of the scores which are between min_score and max_score

Shortcut for obj.result.cumulative_score(). See SearchResult.cumulative_score().

Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.

  • max_score (a float, or None for +infinity) – the maximum score in the range.

  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.

Returns:

a floating point value

format_ids_and_scores_as_bytes(ids=None, precision=4)

Format the ids and scores as the byte string needed for simsearch output

Shortcut for obj.result.format_ids_and_scores_as_bytes(). See SearchResult.format_ids_and_scores_as_bytes().

If there are no hits then the result is the empty string b””, otherwise it returns a byte string containing the tab-seperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], …

If the ids is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings.

The precision sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive.

This function is 3-4x faster than the Python equivalent, which is roughly:

ids = ids if (ids is not None) else self.get_ids()
formatter = ("%s\t%." + str(precision) + "f").encode("ascii")
return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores()))
Parameters:
  • ids (a list of Unicode strings, or None to use the default) – the identifiers to use for each hit.

  • precision (an integer from 1 to 10, inclusive) – the precision to use for each score

Returns:

a byte string

get_ids()

The list of target identifiers (if available), in the current ordering

Shortcut for obj.result.get_ids(). See SearchResult.get_ids().

Returns:

a list of strings

get_ids_and_scores()

The list of (target identifier, target score) pairs, in the current ordering

Shortcut for obj.result.get_ids_and_scores(). See SearchResult.get_ids_and_scores().

Raises a TypeError if the target IDs are not available.

Returns:

a Python list of 2-element tuples

get_indices()

The list of target indices, in the current ordering.

Shortcut for obj.result.get_indices(). See SearchResult.get_indices().

This returns a copy of the scores. See get_indices_as_numpy_array() to get a NumPy array view of the indices.

Returns:

an array.array() of type ‘i’

get_indices_and_scores()

The list of (target index, target score) pairs, in the current ordering

Shortcut for obj.result.get_indices_and_scores(). See SearchResult.get_indices_and_scores().

Returns:

a Python list of 2-element tuples

get_indices_as_numpy_array()

Return a NumPy array view of the underlying indices.

Shortcut for obj.result.get_indices_as_numpy_array(). See SearchResult.get_indices_as_numpy_array().

This is a short-cut for self.as_numpy_array()[“index”]. See that method documentation for details and warning.

Returns:

a NumPy array of type ‘i4’

get_scores()

The list of target scores, in the current ordering

Shortcut for obj.result.get_scores(). See SearchResult.get_scores().

This returns a copy of the scores. See get_scores_as_numpy_array() to get a NumPy array view of the scores.

Returns:

an array.array() of type ‘d’

get_scores_as_numpy_array()

Return a NumPy array view of the underlying scores.

Shortcut for obj.result.get_scores_as_numpy_array(). See SearchResult.get_scores_as_numpy_array().

This is a short-cut for self.as_numpy_array()[“score”]. See that method documentation for details and warning.

Returns:

a NumPy array of type ‘f8’

iter_ids()

Iterate over target identifiers (if available), in the current ordering

Shortcut for obj.result.iter_ids(). See SearchResult.iter_ids().

max()

Return the value of the largest score

Shortcut for obj.result.max(). See SearchResult.max().

Returns 0.0 if there are no results.

Returns:

a float

min()

Return the value of the smallest score

Shortcut for obj.result.min(). See SearchResult.min().

Returns 0.0 if there are no results.

Returns:

a float

property query_id

Return the corresponding query id, if available, else None

Shortcut for simsearch.result.query_id. See SearchResult.query_id.

reorder(ordering='decreasing-score-plus')

Reorder the hits based on the requested ordering.

Shortcut for obj.result.reorder(). See SearchResult.reorder().

The available orderings are:

  • increasing-score - sort by increasing score

  • decreasing-score - sort by decreasing score

  • increasing-score-plus - sort by increasing score, break ties by increasing index

  • decreasing-score-plus - sort by decreasing score, break ties by increasing index

  • increasing-index - sort by increasing target index

  • decreasing-index - sort by decreasing target index

  • move-closest-first - move the hit with the highest score to the first position

  • reverse - reverse the current ordering

Parameters:

ordering (string) – the name of the ordering to use

save(destination, format=None, compressed=True)

Save the SearchResult to the given destination

Shortcut for obj.result.save(). See SearchResult.save().

Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResult is stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().

Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a single SearchResult.

Use chemfp.search.load_npz() to read the similarity search result back into a SearchResult instance.

Parameters:
  • destination (a filename, binary file object, or None for stdout) – where to write the results

  • format (None or 'npz') – the output format name (default: always ‘npz’)

  • compressed – if True (the default), use zipfile compression

property shape: Tuple[int]

return the (number of targets,)

to_pandas(*, columns=['target_id', 'score'])

Return a pandas DataFrame with the target ids and scores

Shortcut for obj.result.to_pandas(). See SearchResult.to_pandas().

The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.

Parameters:

columns (a list of two strings) – column names for the returned DataFrame

Returns:

a pandas DataFrame