chemfp.highlevel.clustering module

This module should not be imported directly.

It contains internal implementation details of the high-level API available from the top-level chemfp module.

This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class chemfp.highlevel.clustering.ButinaClusters(arena, matrix, seed, NxN_threshold, butina_threshold, tiebreaker, false_singletons, num_butina_clusters, rescore, clusterer, result, times, _arena_close, fingerprints_filename, matrix_filename)

Bases: object

The result from chemfp.butina(), with query details, search results, and timing information.

The available properties are:

arena: FingerprintArena

The FingerprintArena, based on the input fingerprints.

matrix: SearchResults

The SearchResults NxN sparse search results, based on the input matrix or generated from the fingerprints.

seed: int

The seed for the RNG.

NxN_threshold: float

The NxN similarity threshold used to generate the matrix from the input fingerprints.

butina_threshold: float

The minimum similarity threshold use for the Butina algorithm.

tiebreaker: str

The specified tiebreaker method.

false_singletons: str

The specified method for handling false singletons.

num_butina_clusters: int

The specified maximum number of clusters, or None.

rescore: bool

The flag value used to request that reassigned fingerprints be re-scored.

clusterer: ButinaClusterer

The underlying ButinaClusterer object.

result: list[ButinaCluster]

The list of ButinaCluster clusters.

times: dict[str, float | None]

A dictionary with a breakdown of the times for the search. Each entry either has the elapsed time in seconds, or None if it wasn’t relevant. The values are:

  • “load_arena” - the time to load the fingerprint arena

  • “load_matrix” - the time to load matrix

  • “load” - the total load time

  • “NxN” - the time to compute the NxN matrix from the fingerprints

  • “cluster” - the time to cluster

  • “prune” - the time to prune the cluster to the specified size

  • “rescore” - the time to rescore reassigned elements

  • “total” - the total time for the chemfp.butina() call.

fingerprints_filename: str, bytes, or Path

The value of the input fingerprints, if it was a filename.

matrix_filename: str, bytes, or Path

The value of the input matrix, if it was a filename.

property all_clusters

The full list of ButinaCluster clusters

The clusters are ordered by cluster index and may include empty clusters, due to moving false singletons or pruning the number of clusters.

as_ctypes() _typing.Sequence[_typing.ButinaAssignment]

Return the cluster assignments as a ctype array.

This returns a ButinaAssignment ctypes array with one entry per input fingerprint.

as_numpy() _typing.NumPyArray

Return the cluster assignments as a NumPy array.

property assignments: _typing.ButinaAssignments

Return the assignments as a ButinaAssignments

close()

Release any assigned resources, like a memory-mapped FPB arena

property clusters

The final list of clusters.

This list is ordered from largest to smallest.

get_description(include_times: bool = False) str

Return a human-readable description of the Butina clustering

get_metadata() Dict[str, str]

Return a dictionary containing entries for output metadata lines

get_times_description() str

Return a human-readable break-down of the Butina compute times

get_type()

Get the ‘type’ string describing the Butina search parameters

save(destination: str | bytes | Path | None | BinaryIO = None, *, format: str | None = None, renumber: bool = True, rename: bool = True, include_members: bool = True, metadata: Dict[str, str] | None = None, include_metadata: bool = True, precision: Literal[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | None = None)

Save the clusters to destination in one of several formats.

The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.

If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.

If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.

If include_members is True (the default) then include cluster members in the output.

If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)

If include_metadata is True (the default) then include metadata information in the output file.

If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.

Parameters:
  • destination (None for stdout, a filename, or file object) – the output destination

  • format (None or str) – the output format

  • renumber (bool) – if True, renumber clusters from 1

  • rename (bool) – if True, use simplified assignment names

  • include_members (bool) – If False, only use cluster centers rather than all members

  • metadata (None or a dict) – extra fields for the output metadata

  • include_metadata (bool) – if True, include metadata in the output

  • precision (None or int) – the number of digits used to format the score

to_pandas(*, columns: _typing.Sequence = ['cluster', 'id', 'type', 'score'], rename: bool = True, renumber: bool = True, sort: bool = True) _typing.PandasDataFrame

Return the assignments as a pandas DataFrame

The DataFrame contains four columns, one for each input fingerprint:

  • cluster is the cluster index

  • id is the identifier from the input matrix

  • type is a string like CENTER” or “MEMBER”

  • score the Tanimoto score

Use columns to specify different column labels.

By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.

By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.

If sort is True (the default) then the output is sorted by cluster index and type (CENTER will be first).

Parameters:
  • columns (a list of two strings) – column names for the returned DataFrame

  • rename (bool) – if True, use simplified assignment names

  • renumber (bool) – if False use the internal cluster ids

  • sort (bool) – if True, sort the output data frame

Returns:

a pandas DataFrame