chemfp.highlevel.clustering module¶

This module should not be imported directly.

It contains internal implementation details of the high-level API available from the top-level chemfp module.

This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class chemfp.highlevel.clustering.ButinaClusters(arena, matrix, seed, NxN_threshold, butina_threshold, tiebreaker, false_singletons, num_butina_clusters, rescore, clusterer, result, times, _arena_close, fingerprints_filename, matrix_filename)¶

Bases: object

The result from chemfp.butina(), with query details, search results, and timing information.

The available properties are:

arena: FingerprintArena¶: The FingerprintArena, based on the input fingerprints.

matrix: SearchResults¶: The SearchResults NxN sparse search results, based on the input matrix or generated from the fingerprints.

seed: int¶: The seed for the RNG.

NxN_threshold: float¶: The NxN similarity threshold used to generate the matrix from the input fingerprints.

butina_threshold: float¶: The minimum similarity threshold use for the Butina algorithm.

tiebreaker: str¶: The specified tiebreaker method.

false_singletons: str¶: The specified method for handling false singletons.

num_butina_clusters: int¶: The specified maximum number of clusters, or None.

rescore: bool¶: The flag value used to request that reassigned fingerprints be re-scored.

clusterer: ButinaClusterer¶: The underlying ButinaClusterer object.

result: list[ButinaCluster]¶: The list of ButinaCluster clusters.

times: dict[str, float | None]¶

A dictionary with a breakdown of the times for the search. Each entry either has the elapsed time in seconds, or None if it wasn’t relevant. The values are:

“load_arena” - the time to load the fingerprint arena
“load_matrix” - the time to load matrix
“load” - the total load time
“NxN” - the time to compute the NxN matrix from the fingerprints
“cluster” - the time to cluster
“prune” - the time to prune the cluster to the specified size
“rescore” - the time to rescore reassigned elements
“total” - the total time for the chemfp.butina() call.

fingerprints_filename: str, bytes, or Path¶: The value of the input fingerprints, if it was a filename.

matrix_filename: str, bytes, or Path¶: The value of the input matrix, if it was a filename.

property all_clusters¶

The full list of ButinaCluster clusters

The clusters are ordered by cluster index and may include empty clusters, due to moving false singletons or pruning the number of clusters.

as_ctypes() → _typing.Sequence[_typing.ButinaAssignment]¶

Return the cluster assignments as a ctype array.

This returns a ButinaAssignment ctypes array with one entry per input fingerprint.

as_numpy() → _typing.NumPyArray¶: Return the cluster assignments as a NumPy array.

property assignments: _typing.ButinaAssignments¶: Return the assignments as a ButinaAssignments

close()¶: Release any assigned resources, like a memory-mapped FPB arena

property clusters¶

The final list of clusters.

This list is ordered from largest to smallest.

get_description(include_times: bool = False) → str¶: Return a human-readable description of the Butina clustering

get_metadata() → Dict[str, str]¶: Return a dictionary containing entries for output metadata lines

get_times_description() → str¶: Return a human-readable break-down of the Butina compute times

get_type()¶: Get the ‘type’ string describing the Butina search parameters

Save the clusters to destination in one of several formats.

The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.

If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.

If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.

If include_members is True (the default) then include cluster members in the output.

If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)

If include_metadata is True (the default) then include metadata information in the output file.

If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.

Parameters:

destination (None for stdout, a filename, or file object) – the output destination
format (None or str) – the output format
renumber (bool) – if True, renumber clusters from 1
rename (bool) – if True, use simplified assignment names
include_members (bool) – If False, only use cluster centers rather than all members
metadata (None or a dict) – extra fields for the output metadata
include_metadata (bool) – if True, include metadata in the output
precision (None or int) – the number of digits used to format the score

to_pandas(*, columns: _typing.Sequence = ['cluster', 'id', 'type', 'score'], rename: bool = True, renumber: bool = True, sort: bool = True) → _typing.PandasDataFrame¶

Return the assignments as a pandas DataFrame

The DataFrame contains four columns, one for each input fingerprint:

cluster is the cluster index

id is the identifier from the input matrix

type is a string like CENTER” or “MEMBER”

score the Tanimoto score

Use columns to specify different column labels.

By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.

By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.

If sort is True (the default) then the output is sorted by cluster index and type (CENTER will be first).

Parameters:

columns (a list of two strings) – column names for the returned DataFrame
rename (bool) – if True, use simplified assignment names
renumber (bool) – if False use the internal cluster ids
sort (bool) – if True, sort the output data frame

Returns:

a pandas DataFrame