chemfp.clustering module¶
This module contains parts of the public Butina API
It is unlikely that you will import this module directly.
Instead, use the high-level chemfp.butina()
function for Butina
clustering.
The clustering results include objects defined in this module.
If you do want to use the low-level API, use
get_butina_clusterer()
to create a ButinaClusterer
, and
use ButinaAssignmentType
to decode the assignment_type
values.
This module also implements the post-processing step to prune the number of clusters, but that is not part of the public API.
- class chemfp.clustering.ButinaAssignment¶
Bases:
Structure
A view of the Butina assignment.
The assignment_type is an internal integer values. Use
ButinaAssignmentType
to convert it to a standard label.The cluster_idx is index of the cluster in which this fingerprint has been assigned.
The score is the fingerprint’s Tanimoto score, which depends on the Butina parameter used.
Do not modify its values.
- assignment_type¶
Structure/Union member
- cluster_idx¶
Structure/Union member
- score¶
Structure/Union member
- class chemfp.clustering.ButinaAssignmentType¶
Bases:
object
provide a mapping from C integer values to assignment names
This is used to map the
ButinaAssignment.assignment_type
to a string label. The name_table gives the direct names while the rename_table is used in theto_pandas()
calls when rename is True.- CENTER = 1¶
- FALSE_SINGLETON = 4¶
- MEMBER = 2¶
- MERGED = 3¶
- MOVED_FALSE_SINGLETON = 5¶
- UNASSIGNED = 0¶
- static get_name(assignment_type) str ¶
- name_table = {0: 'UNASSIGNED', 1: 'CENTER', 2: 'MEMBER', 3: 'MERGED', 4: 'FALSE_SINGLETON', 5: 'MOVED_FALSE_SINGLETON'}¶
- rename_table = {0: 'UNASSIGNED', 1: 'CENTER', 2: 'MEMBER', 3: 'MEMBER', 4: 'CENTER', 5: 'MEMBER'}¶
- class chemfp.clustering.ButinaAssignments¶
Bases:
object
A list-like container of
ButinaAssignment
elements.There is one assignment for each fingerprint, in the same order.
- as_ctypes()¶
Return a ctypes view of the underlying assignment data
The view is a
ButinaAssignment
array with attributes named assignment_type, cluster_idx, and score.
- as_numpy()¶
Return a numpy view of the underlying assignment data
The view has a structure dtype with fields named “assignment_type”, “candidate_idx” and “score”.
- to_pandas(*, columns=('assignment_type', 'cluster_idx', 'score'))¶
Return a pandas DataFrame containing a copy of the underlying data
- Parameters:
columns (None, or three strings) – the three column titles
- class chemfp.clustering.ButinaCluster¶
Bases:
object
A cluster identified by the Butina method
It has a list-like interface to access the cluster members as an index value into the original fingerprints.
Its cluster_idx is the assigned cluster index.
- as_ctypes()¶
Return a ctypes view of the underlying indices as c_int values
- as_numpy()¶
Return a NumPy view of the underlying indices as int32 values
- cluster_idx¶
- get_assignments()¶
Return a list of assignment values for each member
Use
ButinaAssignmentType
to convert the value to a label.
- members¶
Get the full list of member indices
A new list is created for each access.
- move_members(search_results, debug=False)¶
Move cluster members based on the
SearchResults
.This used during pruning. It is not yet part of the public API.
- to_pandas(*, columns=['id', 'type', 'score'], rename=True, sort=True)¶
Return the members as a pandas DataFrame
The DataFrame contains three columns, one for each member:
id is the identifier from the input matrix
type is a string like CENTER” or “MEMBER”
score the Tanimoto score
Use columns to specify different column labels.
By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.
If sort is True (the default) then the output is sorted by assignment type (CENTER will be first).
- Parameters:
columns (three strings, or None to use the internal values) – the column headers for the data frame
rename (bool) – if True, use simplified assignment names
sort (bool) – if True, sort by assignment type
- Returns:
a pandas DataFrame
- class chemfp.clustering.ButinaClusterer¶
Bases:
object
Main processor for Butina clustering
This carries out Butina clustering, where clusters can be processed iteratively, that is, n at a time.
There are two public attributes. clusters is a Python list of
ButinaCluster
objects, which resizes during clustering.The assignments is a
ButinaAssignments
in parallel to the fingerprints.- assignments¶
- clusters¶
- get_metadata()¶
Return a minimal dictionary containing entries for output metadata lines
This low-level picker only knows the “software”.
- Returns:
a dictionary of key/value pairs
- move_false_singletons_to_nearest_center(arena)¶
Move false singletons to the nearest cluter center
This is not yet part of the public API.
- num_remaining¶
Return the number of unassigned fingerprints
- process(timeout=None, debug=False) int ¶
Perform Butina clustering until done or timeout reached.
Return 1 if done, otherwise 0.
The debug is an internal flag used for debugging.
- Parameters:
timeout (None for no timeout, or a non-negative float.) – stop processing after timeout seconds
- prune_clusters(num_clusters, arena, callback=None, debug=False)¶
Prune the clusters down to a given count.
This is not yet part of the public API
- rescore_moved(arena, include_moved_false_singleton=1)¶
Assign scores to the moved members
This is not yet part of the public API.
- save(destination=None, *, format=None, renumber=True, rename=True, include_members=True, metadata=None, include_metadata=True, precision=None)¶
Save the clusters to destination in one of several formats.
The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.
If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.
If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.
If include_members is True (the default) then include cluster members in the output.
If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)
If include_metadata is True (the default) then include metadata information in the output file.
If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.
- Parameters:
destination (None for stdout, a filename, or file object) – the output destination
format (None or str) – the output format
renumber (bool) – if True, renumber clusters from 1
rename (bool) – if True, use simplified assignment names
include_members (bool) – If False, only use cluster centers rather than all members
metadata (None or a dict) – extra fields for the output metadata
include_metadata (bool) – if True, include metadata in the output
precision (None or int) – the number of digits used to format the score
- to_pandas(*, columns=['cluster', 'id', 'type', 'score'], rename=True, renumber=True, sort=True)¶
Return the assignments as a pandas DataFrame
The DataFrame contains four columns, one for each input fingerprint:
cluster is the cluster index
id is the identifier from the input matrix
type is a string like CENTER” or “MEMBER”
score the Tanimoto score
Use columns to specify different column labels.
By default the assignment types are relabeled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.
By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.
If sort is True (the default) then the output is sorted by cluster index and type (CENTER will be first).
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
rename (bool) – if True, use simplified assignment names
renumber (bool) – if False use the internal cluster ids
sort (bool) – if True, sort the output data frame
- Returns:
a pandas DataFrame
- class chemfp.clustering.ButinaRanking¶
Bases:
Structure
An internal object used to order the Butina clusters
The largest clusters come first, with ties broken based on the index (if tiebreaker is “first” or “last”) or at random (if tiebreaker is “random”).
This is not part of the public API. Do not modify these values.
- entry_index¶
Structure/Union member
- num_threshold_hits¶
Structure/Union member
- class chemfp.clustering.ButinaRankings¶
Bases:
object
A list-like container of
ButinaRanking
elementsThere is one ranking for each fingerprint.
This isn’t part of the public API. If you find this useful let me know! I mostly use it for debugging.
Do not modify these values.
- as_ctypes()¶
Return a ctypes view of the underlying assignment data.
This returns an array of
ButinaRanking
.
- as_numpy()¶
Return a numpy view of the underlying assignment data
The view has a structured dtype with fields named “num_threshold_hits” and “entry_idx”.
- rank_index¶
The current position when processing the rankings
- chemfp.clustering.get_butina_clusterer(matrix, threshold=0.0, seed=-1, tiebreaker='randomize', follow_neighbor=True)¶
Initialize and return a
ButinaClusterer
The Butina clustering will cluster the matrix containing a
SearchResults
sparse NxN search matrix with the given threshold.If tiebreaker is “randomize” (the default) then the next picked center will be chosen at random from the available picks. (These are ranked by the total number of neighbors.) If “first” or “last” then the first or last neighbor, in arena index order, is picked.
Use seed to initialize the random number generator. If -1 (the default), butina will use Python’s RNG to get the initial seed. Otherwise this must be an integer between 0 and 2**64-1.
If follow_neighbor is True, assign each false singleton to the same centroid as a first nearest neighbor, selected arbitrarily.
- Parameters:
matrix (a
SearchResults
) – the NxN sparse comparison matrixthreshold (float) – the Butina clustering threshold
seed (int) – the RNG seed
tiebreaker (str) – the method used to break ties
follow_neighbor (bool) – if True, use a reassignment method for false singletones
- Returns: