chemfp.fps_search module¶

FPS file similarity search and search result implementations.

Chemfp implements similarity search methods which work directly on FPS files. This might be useful in a streaming environment (where the FPS data is generated on-the-fly and not saved), and where you have at most a handful of queries. In that case, an FPS search is faster than an arena-based search because the FPS parsing overhead is about the same, but the FPS search have the arena creation or memory overhead an in-memory search would have.

class chemfp.fps_search.FPSSearchResult(ids, scores, query_id=None)¶

Bases: object

Search results for a query fingerprint against a target FPS reader.

The results contains a list of hits. Hits contain a target id and score. The hits can be reordered based on id or score.

__getitem__(item)¶: Return the (id, score) pair for the given index, or pairs if item is a slice

__iter__()¶: Iterate through the pairs of (target id, score) using the current ordering

__len__()¶: Return the number of hits.

get_ids()¶

The list of target identifiers in the current ordering.

This returns the same list each time.

get_ids_and_scores()¶: The list of (target identifier, target score) pairs, in the current ordering

get_scores()¶

The list of target scores, in the current ordering.

This returns the same list each time.

query_id¶: The id of the query fingerprint, if available, otherwise None.

reorder(order='decreasing-score')¶

Reorder the hits based on the requested ordering.

The available orderings are:

increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-id - sort by increasing target id
decreasing-id - sort by decreasing target id
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering

scores¶: The similarity scores for the hits.

to_pandas(*, columns=['target_id', 'score'])¶

Return a pandas DataFrame with the target ids and scores

The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.

Parameters:: columns (a list of two strings) – column names for the returned DataFrame
Returns:: a pandas DataFrame

class chemfp.fps_search.FPSSearchResults(query_ids, results)¶

Bases: object

Search results for a query arena against a target FPS reader.

__getitem__(i)¶: Return a SearchResult by index

__iter__()¶: Iterate through the search results

__len__()¶: The number of search results in this collection

iter_ids()¶: For each search result, yield the list of target identifiers

iter_ids_and_scores()¶: For each search result, yield the list of target (id, score) tuples

iter_scores()¶: For each search result, yield the list of target scores

query_ids¶: A list of query ids, one for each result. This comes from the query arena’s ids.

reorder_all(order='decreasing-score')¶

Reorder the hits for all of the rows based on the requested order.

The available orderings are:

increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-id - sort by increasing target id
decreasing-id - sort by decreasing target id
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering

to_pandas(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶

Return a pandas DataFrame with query_id, target_id and score columns.

Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.

If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).

Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.

Parameters:

columns (a list of three strings) – column names for the returned DataFrame
empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case

Returns:

a pandas DataFrame

chemfp.fps_search.count_tanimoto_hits_fp(query_fp, target_reader, threshold=0.7)¶

Count the number of hits in target_reader at least threshold similar to the query_fp

This uses Tanimoto similarity.

chemfp.fps_search.count_tanimoto_hits_arena(query_arena, target_reader, threshold=0.7)¶

For each fingerprint in query_arena, count the number of hits in target_reader at least threshold similar to it

This uses Tanimoto similarity.

chemfp.fps_search.threshold_tanimoto_search_fp(query_fp, target_reader, threshold=0.7)¶

Find matches in the target reader which are at least threshold similar to the query fingerprint

Returns:: an FPSSearchResult instance contain the result.

chemfp.fps_search.threshold_tanimoto_search_arena(query_arena, target_reader, threshold)¶

Find matches in the target reader which are at least threshold similar to the query arena fingerprints

Returns:: an FPSSearchResults instance containing a list of query results.

chemfp.fps_search.knearest_tanimoto_search_fp(query_fp, target_reader, k=3, threshold=0.0)¶

Find the nearest k matches in the target reader which are at least threshold similar to the query fingerprint

This uses Tanimoto similarity.

Returns:: an FPSSearchResult instance contain the result.

chemfp.fps_search.knearest_tanimoto_search_arena(query_arena, target_reader, k=3, threshold=0.0)¶

Find the nearest k matches in the target reader which are at least threshold similar to the query arena fingerprints

This uses Tanimoto similarity.

Returns:: an FPSSearchResults instance containing a list of query results.

chemfp.fps_search.count_tversky_hits_fp(query_fp, target_reader, threshold, alpha=1.0, beta=1.0)¶

Count the number of hits in target_reader at least threshold similar to the query_fp