chemfp.clustering module

This module contains parts of the public Butina API

It is unlikely that you will import this module directly.

Instead, use the high-level chemfp.butina() function for Butina clustering.

The clustering results include objects defined in this module.

If you do want to use the low-level API, use get_butina_clusterer() to create a ButinaClusterer, and use ButinaAssignmentType to decode the assignment_type values.

This module also implements the post-processing step to prune the number of clusters, but that is not part of the public API.

class chemfp.clustering.ButinaAssignment

Bases: Structure

A view of the Butina assignment.

The assignment_type is an internal integer values. Use ButinaAssignmentType to convert it to a standard label.

The cluster_idx is index of the cluster in which this fingerprint has been assigned.

The score is the fingerprint’s Tanimoto score, which depends on the Butina parameter used.

Do not modify its values.

assignment_type

Structure/Union member

cluster_idx

Structure/Union member

score

Structure/Union member

class chemfp.clustering.ButinaAssignmentType

Bases: object

provide a mapping from C integer values to assignment names

This is used to map the ButinaAssignment.assignment_type to a string label. The name_table gives the direct names while the rename_table is used in the to_pandas() calls when rename is True.

CENTER = 1
FALSE_SINGLETON = 4
MEMBER = 2
MERGED = 3
MOVED_FALSE_SINGLETON = 5
UNASSIGNED = 0
static get_name(assignment_type) str
name_table = {0: 'UNASSIGNED', 1: 'CENTER', 2: 'MEMBER', 3: 'MERGED', 4: 'FALSE_SINGLETON', 5: 'MOVED_FALSE_SINGLETON'}
rename_table = {0: 'UNASSIGNED', 1: 'CENTER', 2: 'MEMBER', 3: 'MEMBER', 4: 'CENTER', 5: 'MEMBER'}
class chemfp.clustering.ButinaAssignments

Bases: object

A list-like container of ButinaAssignment elements.

There is one assignment for each fingerprint, in the same order.

as_ctypes()

Return a ctypes view of the underlying assignment data

The view is a ButinaAssignment array with attributes named assignment_type, cluster_idx, and score.

as_numpy()

Return a numpy view of the underlying assignment data

The view has a structure dtype with fields named “assignment_type”, “candidate_idx” and “score”.

to_pandas(*, columns=('assignment_type', 'cluster_idx', 'score'))

Return a pandas DataFrame containing a copy of the underlying data

Parameters:

columns (None, or three strings) – the three column titles

class chemfp.clustering.ButinaCluster

Bases: object

A cluster identified by the Butina method

It has a list-like interface to access the cluster members as an index value into the original fingerprints.

Its cluster_idx is the assigned cluster index.

as_ctypes()

Return a ctypes view of the underlying indices as c_int values

as_numpy()

Return a NumPy view of the underlying indices as int32 values

cluster_idx
get_assignments()

Return a list of assignment values for each member

Use ButinaAssignmentType to convert the value to a label.

members

Get the full list of member indices

A new list is created for each access.

move_members(search_results, debug=False)

Move cluster members based on the SearchResults.

This used during pruning. It is not yet part of the public API.

to_pandas(*, columns=['id', 'type', 'score'], rename=True, sort=True)

Return the members as a pandas DataFrame

The DataFrame contains three columns, one for each member:

  • id is the identifier from the input matrix

  • type is a string like CENTER” or “MEMBER”

  • score the Tanimoto score

Use columns to specify different column labels.

By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.

If sort is True (the default) then the output is sorted by assignment type (CENTER will be first).

Parameters:
  • columns (three strings, or None to use the internal values) – the column headers for the data frame

  • rename (bool) – if True, use simplified assignment names

  • sort (bool) – if True, sort by assignment type

Returns:

a pandas DataFrame

class chemfp.clustering.ButinaClusterer

Bases: object

Main processor for Butina clustering

This carries out Butina clustering, where clusters can be processed iteratively, that is, n at a time.

There are two public attributes. clusters is a Python list of ButinaCluster objects, which resizes during clustering.

The assignments is a ButinaAssignments in parallel to the fingerprints.

assignments
clusters
get_metadata()

Return a minimal dictionary containing entries for output metadata lines

This low-level picker only knows the “software”.

Returns:

a dictionary of key/value pairs

move_false_singletons_to_nearest_center(arena)

Move false singletons to the nearest cluter center

This is not yet part of the public API.

num_remaining

Return the number of unassigned fingerprints

process(timeout=None, debug=False) int

Perform Butina clustering until done or timeout reached.

Return 1 if done, otherwise 0.

The debug is an internal flag used for debugging.

Parameters:

timeout (None for no timeout, or a non-negative float.) – stop processing after timeout seconds

prune_clusters(num_clusters, arena, callback=None, debug=False)

Prune the clusters down to a given count.

This is not yet part of the public API

rescore_moved(arena, include_moved_false_singleton=1)

Assign scores to the moved members

This is not yet part of the public API.

save(destination=None, *, format=None, renumber=True, rename=True, include_members=True, metadata=None, include_metadata=True, precision=None)

Save the clusters to destination in one of several formats.

The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.

If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.

If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.

If include_members is True (the default) then include cluster members in the output.

If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)

If include_metadata is True (the default) then include metadata information in the output file.

If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.

Parameters:
  • destination (None for stdout, a filename, or file object) – the output destination

  • format (None or str) – the output format

  • renumber (bool) – if True, renumber clusters from 1

  • rename (bool) – if True, use simplified assignment names

  • include_members (bool) – If False, only use cluster centers rather than all members

  • metadata (None or a dict) – extra fields for the output metadata

  • include_metadata (bool) – if True, include metadata in the output

  • precision (None or int) – the number of digits used to format the score

to_pandas(*, columns=['cluster', 'id', 'type', 'score'], rename=True, renumber=True, sort=True)

Return the assignments as a pandas DataFrame

The DataFrame contains four columns, one for each input fingerprint:

  • cluster is the cluster index

  • id is the identifier from the input matrix

  • type is a string like CENTER” or “MEMBER”

  • score the Tanimoto score

Use columns to specify different column labels.

By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.

By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.

If sort is True (the default) then the output is sorted by cluster index and type (CENTER will be first).

Parameters:
  • columns (a list of two strings) – column names for the returned DataFrame

  • rename (bool) – if True, use simplified assignment names

  • renumber (bool) – if False use the internal cluster ids

  • sort (bool) – if True, sort the output data frame

Returns:

a pandas DataFrame

class chemfp.clustering.ButinaRanking

Bases: Structure

An internal object used to order the Butina clusters

The largest clusters come first, with ties broken based on the index (if tiebreaker is “first” or “last”) or at random (if tiebreaker is “random”).

This is not part of the public API. Do not modify these values.

entry_index

Structure/Union member

num_threshold_hits

Structure/Union member

class chemfp.clustering.ButinaRankings

Bases: object

A list-like container of ButinaRanking elements

There is one ranking for each fingerprint.

This isn’t part of the public API. If you find this useful let me know! I mostly use it for debugging.

Do not modify these values.

as_ctypes()

Return a ctypes view of the underlying assignment data.

This returns an array of ButinaRanking.

as_numpy()

Return a numpy view of the underlying assignment data

The view has a structured dtype with fields named “num_threshold_hits” and “entry_idx”.

rank_index

The current position when processing the rankings

chemfp.clustering.get_butina_clusterer(matrix, threshold=0.0, seed=-1, tiebreaker='randomize', follow_neighbor=True)

Initialize and return a ButinaClusterer

The Butina clustering will cluster the matrix containing a SearchResults sparse NxN search matrix with the given threshold.

If tiebreaker is “randomize” (the default) then the next picked center will be chosen at random from the available picks. (These are ranked by the total number of neighbors.) If “first” or “last” then the first or last neighbor, in arena index order, is picked.

Use seed to initialize the random number generator. If -1 (the default), butina will use Python’s RNG to get the initial seed. Otherwise this must be an integer between 0 and 2**64-1.

If follow_neighbor is True, assign each false singleton to the same centroid as a first nearest neighbor, selected arbitrarily.

Parameters:
  • matrix (a SearchResults) – the NxN sparse comparison matrix

  • threshold (float) – the Butina clustering threshold

  • seed (int) – the RNG seed

  • tiebreaker (str) – the method used to break ties

  • follow_neighbor (bool) – if True, use a reassignment method for false singletones

Returns:

a ButinaClusterer