chemfp.diversity module¶
This module contains interfaces to chemfp’s diversity selection algorithms.
Terminology¶
The selection algorithms uses different concepts of “dissimilar” to iteratively pick one or more dissimilar fingerprints from an arena containing candidate fingerprints.
The picked fingerprints are dissimilar to all other candidate fingerprints, and optionally also dissimilar to fingerprints in an arena of “reference” fingerprints.
This latter case may be used to select diverse fingerprints from a vendor catalog (“the candidates”) which are also dissimilar to an in-house compound library (“the references”).
To create a given picker, use one of the get_*_picker
functions
or, alternatively one of the picker class’s from_
methods. Do not
call the class constructor directly.
Each picker implements a pick_n()
method, along with some variations,
to pick an additional n
items. They also implement several iter_*()
methods to iteratively get the next pick.
MaxMin picker¶
The MaxMinPicker
implements the MaxMin algorithm[1][2]. This
algorithm iteratively picks fingerprints from a set of candidates such
that the newly picked fingerprint has the smallest Tanimoto similarity
compared to any previously picked fingerprint, and optionally also the
smallest Tanimoto similarity to the reference fingerprints.
The MaxMin diversity score for a given pick is the maximum Tanimoto score between that pick and all previous picks and the reference arena. If there is no reference arena then the diversity score of the first pick is 0.0.
HeapSweep picker¶
The HeapSweepPicker
implements a sweep-based algorithm to
pick fingerprints based on their maximum Tanimoto similarity to any
other fingerprint in the arena, from least maximum similarity to
most. This method uses a heap to track the current highest-known score
for each fingerprints. Each sweep compares a fingerprint with the
smallest score to all other fingerprints, while also updating the
highest-known score for each other fingerprint.
The heapsweep algorithm is used to find the initial pick for the MaxMin picker if references fingerprints or an initial pick are not specified. This algorithm is significantly slower than MaxMin (over 100-fold!), and is mostly here to find all initial picks with same minimum maximum score. While it can be used to find the diversity score for all fingerprints, a k=1 NxN nearest-neighbor search will be faster and can make use of multiple cores.
The heapsweep diversity score for a given pick is the maximum Tanimoto score between that pick and all other fingerprints in the arena.
The heapsweep algorithm appears to be novel to chemfp. It is strongly influenced by the “Sweep” family of algorithms. See the SumSweep paper [3] for a description of many of those heuristics.
Sphere exclusion picker¶
The SphereExclusionPicker
implements the sphere exclusion
algorithm[4] with optional ranking for directed sphere exclusion[5].
This method iteratively picks fingerprints from a set of candidates
such that the fingerprint is not within a given threshold of similarity
to any previously selected fingerprint.
By default it picks fingerprints with the smallest number of set bits. It can also be configured to pick a fingerprint, or to pick a fingerprint by the smallest associated rank (again, either by the smallest number of set bits or randomly).
The DISERanker class implements the Gobbi and Lee[5] ranking algorithm to generate ranks that can be passed to the SphereExclusionPicker.
[1] Ashton M., Barnard J., Casset F., Charlton M., Downs G., Gorse D., Holliday J., Lahana R., Willett P. (2002). Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quantitative Structure-Activity Relationships 21 (6) 598-604. https://doi.org/10.1002/qsar.200290002
[2] Sayle, R. (2017). Recent Improvements to the RDKit. https://github.com/rdkit/UGM_2017/blob/master/Presentations/Sayle_RDKitDiversity_Berlin17.pdf
[3] Borassi, M., Crescenzi, P., Habib, M., Kosters, W. A., Marino, A., and Takes, F. W. (2015). Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs: With an application to the six degrees of separation games. Theoretical Computer Science 586 (2015) 59-80. http://dx.doi.org/10.1016/j.tcs.2015.02.033
[4] Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J., Osman, J. (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure-Activity Relationships, 15(4), 285-289. https://doi.org/10.1002/qsar.19960150402
[5] Gobbi, A., Lee, M. L. (2003). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317-323. https://doi.org/10.1021/ci025554v
- class chemfp.diversity.BaseMaxMinPicker¶
Bases:
object
The base class for the MaxMin and HeapSweep pickers
Its candidate_arena attribute is the
FingerprintArena
used for picking.- candidate_arena¶
- candidates¶
Get access to the remaining candidates as a
chemfp.diversity.MaxMinCandidates
NOTE: This is not part of the public API.
- iter_ids(max_similarity=1.0)¶
Iteratively make a pick, yielding the candidate id each time
- iter_ids_and_scores(max_similarity=1.0)¶
Iteratively make a pick, yielding (candidate id, diversity score) each time
- iter_indices(max_similarity=1.0)¶
Iteratively make a pick, yielding the candidate index each time
- iter_indices_and_scores(max_similarity=1.0)¶
Iteratively make a pick, yielding (candidate index, diversity score) each time
- picks¶
Get access to all of the picks so far (including initial picks) as a
chemfp.diversity.Picks
- class chemfp.diversity.BasePicks¶
Bases:
object
Information about the picks (ids and indices).
Do not modify its values.
- as_ctypes()¶
Return a ctypes view of the underlying pick data
The view is a
Pick
array with attributes named “candidate_idx” and “popcount”.
- as_numpy()¶
Return a NumPy view of the underlying pick data
The view has a structured dtype with fields named “candidate_idx” and “popcount”.
- get_ids()¶
Return a list of ids for each pick
- get_indices()¶
Return a list of indices into the candidates arena for each pick
- to_pandas(*, column='pick_id')¶
Return the pick ids as a pandas DataFrame
The default column header is “pick_id”. Use column to specify an alternate header.
- Parameters:
column (a string) – the column header for the pick ids
- Returns:
a pandas DataFrame
- class chemfp.diversity.Candidate¶
Bases:
Structure
A view of a candidate fingerprint in the picker.
Do not modify its values.
- c¶
Structure/Union member
- candidate_idx¶
Structure/Union member
- d¶
Structure/Union member
- depth¶
Structure/Union member
- popcount¶
Structure/Union member
- reference_popcount¶
Structure/Union member
- class chemfp.diversity.DISERanker(dise_arena)¶
Bases:
object
Generate a fingerprint ranking based on the method in the DISE paper.
The next pick is the candidate fingerprint closest to the first fingerprint in the input dise_arena, with ties broken by the similarity to the second fingerprint, etc.
This class can be used to generate values passed to SphereExclusionPicker’s ranks parameter.
The class variable DISE_SMILES_LIST contains the SMILES strings for the three reference compounds used in the DISE paper by Gobbi and Lee.
The public attributes are:
- dise_arena¶
The reference
FingerprintArena
used for ranking.
- DISE_SMILES_LIST = ['CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C.O=C(O)CC(O)(C(O)=O)CC(O)=O', 'O=C(OC)\\C3=C(\\N\\C(=C(\\C(=O)OC(C)C)C3c1cccc2nonc12)C)C', 'O=C(OCC)[C@@H](N[C@@H]2C(=O)N(c1ccccc1CC2)CC(=O)O)CCc3ccccc3']¶
- static from_dise_paper(fptype, reader_args=None)¶
Use the structures from the DISE paper to create a DISERanker for a given fingerprint type
The structures are the SMILES strings in DISE_SMILES_LIST.
- Parameters:
fptype (a string or a
chemfp.types.FingerprintType
) – the fingerprint type used to process the SMILES stringsreader_args (None, or a dictionary) – optional reader arguments for SMILES processing
- Returns:
- static from_fingerprints(fingerprints, metadata=None)¶
Use a list of fingerprints to create a DISERanker
This is a short-hand for:
arena = load_fingerprints(fingerprints, metadata=metadata, reorder=False) return DISERanker(arena)
See
chemfp.load_fingerprints()
for full details.- Parameters:
fingerprints – the fingerprints to use
metadata (a
chemfp.Metata
) – the metadata used if fingerprints is an (id, fp) iterator
- Returns:
- static from_smiles_list(fptype, smiles_list, reader_args=None)¶
Use a list of SMILES string to create a DISERanker for a given fingerprint type
- Parameters:
fptype (a string or a
chemfp.types.FingerprintType
) – the fingerprint type used to process the SMILES stringssmiles_list (a list of strings) – the list of SMILES strings
reader_args (None, or a dictionary) – optional reader arguments for SMILES processing
- Returns:
- rank_arena(arena, rng=None)¶
Return an array of ranks, one for each fingerprint.
The algorithm starts by ranking each arena fingerprint to the first reference fingerprint. Fingerprints with a low rank value are more similar to the reference fingerprint than fingerprints with a high rank value.
Ties are broken by similarity to each successive reference fingerprint (in self.dise_arena).
If rng is None then any final ties are left as-is, otherwise ties are broken by the passed-in rng using its rng.shuffle() method.
If rng is an integer then use Python’s
random.Random(rng)
to create the rng.- Parameters:
arena (a
chemfp.arena.FingerprintArena
) – a fingerprint arenarng (None, an integer, or an object with a shuffle method.) – an RNG used to break any final ties
- Returns:
an array.array of ranks, one for each arena fingerprint
- class chemfp.diversity.HeapSweepPicker¶
Bases:
BaseMaxMinPicker
An implementation of the heapsweep picker algorithm
The constructor must not be called directly. Instead, use
HeapSweepPicker.from_candidates()
.Once you have a picker, use
HeapSweepPicker.pick_n()
orHeapSweepPicker.pick_n_with_scores()
to pick the next n candidates, optionally also with its heapsweep score.Alternatively, use
iter_indices()
oriter_ids()
, to pick the next candidate, yielding either the pick index or pick id; or useiter_indices_and_scores()
, oriter_ids_and_scores()
to also include the heapsweep diversity score.- static from_candidates(candidate_arena, *, randomize=True, seed=-1)¶
Use heapsweep to pick diverse fingerprints from the candidate arena
The heapsweep diversity score for a fingerprint is the maximum Tanimoto score between that fingerprint and all other fingerprints in the candidate_arena. The heapsweep method iteratively picks fingerprints from most diverse (smallest maximum Tanimoto) to least.
If randomize is True (the default), the candidates are shuffled before the heapsweep algorithm starts. Shuffling should only affect the ordering of fingerprints with identical diversity scores. It is True by default so the first picked fingerprint is the same as
MaxMinPicker.from_candidates()
. Setting to False should generally be slightly faster.The shuffle and heapsweep methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick fromrandomize (True to shuffle, False to leave as-is) – shuffle the candidates before picking?
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
- Returns:
- pick_n(n, max_similarity=1.0, timeout=None)¶
Pick up to n candidates with a globally maximum similarity of no more than max_similarity
The picks are appended to the MaxMinPicker’s self.picks and the pick information (picked candidate fingerprint indices and corresponding ids) is returned in a :class:.Picks instance.
Use
HeapSweepPicker.pick_n_with_scores()
if you also need the maximum similarity score.n may zero, in which case an empty
Picks
instance is returned.Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progress bars and status updates.
- Parameters:
n (an integer) – the maximum number of remaining candidates to pick
max_similarity (a float) – the maximum allowed pick similarity
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- pick_n_with_scores(n, max_similarity=1.0, result=None, timeout=None)¶
Pick up to n candidates
The picks are appended to the HeapSweepPicker’s self.picks and the pick information (picked candidate fingerprint indices, maximum score, and corresponding ids) is returned in a
PicksAndScores
instance.Use
HeapSweepPicker.pick_n()
if you do not need the maximum similarity score.n may zero, in which case an empty
PicksAndScores
is returned. This may be useful in combination with the result parameter to accumulate successive picks.If result is a
PicksAndScores
returned from a previousHeapSweepPicker.pick_n_with_scores()
call then the pick information will be stored in that instance instead of creating a new one.Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progress bars and status updates.
- Parameters:
n (an integer) – the maximum number of remaining candidates to pick
max_similarity (a float) – the maximum allowed pick similarity
result – store picks in the given object instead of creating a new result object
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- class chemfp.diversity.MaxMinCandidates¶
Bases:
object
Get access to the remaining MaxMin or HeapSweep candidates.
NOTE: This is an internal API used for testing and not part of the public API. Do not modify any values.
If you find it useful, let me know.
- as_ctypes()¶
Get a ctypes view of the underlying Candidate data
- as_numpy()¶
Get a numpy view of the underlying Candidate data
- get_indices()¶
Return a list of indices into the candidates arena
- class chemfp.diversity.MaxMinPicker¶
Bases:
BaseMaxMinPicker
An implementation of the MaxMin picker algorithm (Ashton, et al.)
The constructor must not be called directly. Instead, use one of:
Once you have a picker, use
MaxMinPicker.pick_n()
orMaxMinPicker.pick_n_with_scores()
to pick the the next n candidates, optionally also with its MaxMin diversity score.Alternatively, use
iter_indices()
oriter_ids()
to pick the next candidate, yielding either the pick index or pick id; or useiter_indices_and_scores()
, oriter_ids_and_scores()
to also include the MaxMin score.- static from_candidates(candidate_arena, *, randomize=True, seed=-1)¶
Use MaxMin to pick diverse fingerprints from the candidate arena
The initial pick is determined by the heapsweep algorithm, which selects a fingerprint with the globally smallest maximum Tanimoto score to any other fingerprint. This may take a few seconds so use
MaxMinPicker.from_candidates_and_initial_pick()
if you know the initial pick.If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.
The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick fromrandomize (True to shuffle, False to leave as-is) – shuffle the candidates before picking?
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
- Returns:
- static from_candidates_and_initial_pick(candidate_arena, initial_pick, *, randomize=True, seed=-1)¶
Use MaxMin to pick diverse fingerprints from the candidate arena, starting with an initial pick
This method lets you specify the initial pick as an initial_pick index into the candidate arena.
There are several strategies for the initial MaxMin pick: use the “middle” fingerprint, use a randomly selected fingerprint, or, if heapsweep identifies that multiple fingerprints have the same smallest maximum Tanimoto score, then try each of those as starting point.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.
Shuffling depends on a RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick frominitial_pick (an integer) – the index of the initial pick, which must be a non-empty fingerprint
randomize (True to shuffle, False to leave as-is) – shuffle the candidates before picking?
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
- Returns:
- static from_candidates_and_references(candidate_arena, reference_arena, *, randomize=True, seed=-1)¶
Use MaxMin to pick diverse fingerprints from the candidate arena, which are also diverse from the reference arena
The fingerprints in candidate_arena are ranked according to their most similar fingerprint in reference_arena. A fingerprint with the the smallest maximum score is used as the initial pick when applying the MaxMin algorithm to the remaining fingerprint in the candidate_arena.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.
Shuffling depends on a RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick fromreference_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing reference fingerprintsrandomize (True to shuffle, False to leave as-is) – shuffle the candidates before picking?
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
- Returns:
- pick_n(n, max_similarity=1.0, timeout=None)¶
Pick up to n candidates with a maximum similarity of max_similarity to any previous pick
The picks are appended to the MaxMinPicker’s self.picks data and the pick information (picked candidate fingerprint indices and corresponding ids) is returned in a
Picks
instance.Use
MaxMinPicker.pick_n_with_scores()
if you also need the maximum similarity score.n may zero, in which case an empty
Picks
instance is returned.Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progess bars and status updates.
- Parameters:
n (an integer) – the maximum number of remaining candidates to pick
max_similarity (a float) – the maximum allowed pick similarity
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- pick_n_with_scores(n, max_similarity=1.0, result=None, timeout=None)¶
Pick up to n candidates with a maximum similarity of max_similarity to any previous pick
The picks are appended to the MaxMinPicker’s self.picks data and the pick information (picked candidate fingerprint indices, maximum score, and corresponding ids) is returned in a
PicksAndScores
instance.Use
MaxMinPicker.pick_n()
if you do not need the maximum similarity score.n may zero, in which case an empty
PicksAndScores
is returned. This may be useful in combination with the result parameter to accumulate successive picks.If result is a
PicksAndScores
returned from a previousMaxMinPicker.pick_n_with_scores()
call then the pick information will be stored in that instance instead of creating a new one.Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progess bars and status updates.
- Parameters:
n (an integer) – the maximum number of remaining candidates to pick
max_similarity (a float) – the maximum allowed pick similarity
result (a
chemfp.diversity.PicksAndScores
) – store picks in the given object instead of creating a new result objecttimeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- class chemfp.diversity.Neighbors¶
Bases:
PicksAndScores
Access the sphere exclusion neighbor indices, score, and ids
- as_ctypes()¶
Return a ctypes view of the underlying neighbor data
The view is a
PickAndScore
array with attributes named candidate_idx and score.
- as_numpy()¶
Return a numpy view of the underlying neighbor data
The view has a structure dtype with fields named “candidate_idx” and “score”.
- get_ids()¶
Return a list of neighbor ids for the exclusion sphere
- get_ids_and_scores()¶
Return a tuple of (id, score) for the neighbors in the exclusion sphere
- get_indices()¶
Return a list of indices into the candidate arena for the neighbors
- get_indices_and_scores()¶
Return a tuple of (arena indices, score) for the neighbors
- get_scores()¶
Return a list of scores for the neighbors in the exclusion sphere
- reorder(ordering='decreasing-score-plus')¶
Reorder the neighbors based on the requested ordering.
The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing index
decreasing-index - sort by decreasing index
move-closest-first - move the neighbor with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- to_pandas(*, columns=['neighbor_id', 'score'])¶
Return a pandas DataFrame with the sphere neighbor ids and scores
The first column contains the ids, the second column contains the ids. The default columns headers are “neighbor_id” and “score”. Use columns to specify different headers.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
- class chemfp.diversity.Pick¶
Bases:
Structure
A view of a picked fingerprint in the picker.
Do not modify its values.
- candidate_idx¶
Structure/Union member
- popcount¶
Structure/Union member
- class chemfp.diversity.PickAndScore¶
Bases:
Structure
A picked fingerprint index and score.
Do not modify its values.
- candidate_idx¶
Structure/Union member
- score¶
Structure/Union member
- class chemfp.diversity.PicksAndCounts¶
Bases:
BasePicks
Information about the sphere exclusion picks (ids and indices) and counts. Do not modify its values.
- get_counts()¶
Return the array of counts for the picks
- get_ids()¶
Return a list of pick ids for each pick
- get_ids_and_counts()¶
Return a list of (pick id, count) for each pick
- get_indices_and_counts()¶
Return a list of (arena index, count) for each pick
- to_pandas(*, columns=['pick_id', 'count'])¶
Return a pandas DataFrame with the pick ids and sphere exclusion counts.
The first column contains the ids, the second column contains the sphere exclusion counts. The default columns headers are “pick_id” and “count”. Use columns to specify different headers.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
- class chemfp.diversity.PicksAndNeighbors¶
Bases:
BasePicks
Information about the sphere exclusion picks (ids and indices) neighbors.
Do not modify its values.
A “neighbor” is a candidate index within the pick’s sphere similarity threshold, and may include the pick.
- get_all_neighbors()¶
Return the list of all neighbors for each pick
- get_counts()¶
Return the array of counts for the picks
- get_ids_and_counts()¶
Return a list of (pick id, count) for each pick
- get_ids_and_neighbors()¶
Return a tuple of (pick id, neighbors) for each pick
- get_indices_and_counts()¶
Return a list of (pick index, count) for each pick
- get_indices_and_neighbors()¶
Return a tuple of (candidate arena index, neighbors) for each pick
- to_pandas(*, columns=['pick_id', 'neighbor_id', 'score'], empty=('*', None))¶
Return a pandas DataFrame with pick id and its sphere neighbor ids and scores
Each pick has zero or more neighbors. Each neighbor becomes a row in the output table, with the pick id in the first column, the neighbor id in the second, and the hit score in the third.
The default columns headers are “pick_id”, “neighbor_id” and “score”. Use columns to specify different headers.
If a pick has no neighbors then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.
- Parameters:
columns (a list of three strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
- class chemfp.diversity.PicksAndScores¶
Bases:
object
Access the pick indices, scores, and ids
- as_ctypes()¶
Return a ctypes view of the underlying hit data
The view is a PickAndScore array with attributes named candidate_idx and score.
- as_numpy()¶
Return a numpy view of the underlying hit data
The view has a structure dtype with fields named “candidate_idx” and “score”.
- get_ids()¶
Return a list of identifiers for the picks
- get_ids_and_scores()¶
Return a tuple of (id, score) for the picks
- get_indices()¶
Return a list of indices into the candidate arena for the picks
- get_indices_and_scores()¶
Return a tuple of (arena indices, score) for the picks
- get_scores()¶
Return a list of scores for the picks
- move_pick_index_to_first(pick_index)¶
Move the pick with the given index to the first position in the list
raises IndexError if the pick_index does not exist.
This lets spherex output always have the center as the first member.
- reorder(ordering='decreasing-score-plus')¶
Reorder the picks based on the requested ordering.
The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing pick index
decreasing-index - sort by decreasing pick index
move-closest-first - move the pick with the highest score to the first position
reverse - reverse the current ordering
- Parameters:
ordering (string) – the name of the ordering to use
- to_pandas(*, columns=['pick_id', 'score'])¶
Return a pandas DataFrame with the pick ids and scores
The first column contains the ids, the second column contains the ids. The default columns headers are “pick_id” and “score”. Use columns to specify different headers.
- Parameters:
columns (a list of two strings) – column names for the returned DataFrame
- Returns:
a pandas DataFrame
- class chemfp.diversity.SphereExclusionCandidates¶
Bases:
object
Get access to the remaining sphere exclusion candidates.
NOTE: This is an internal API used for testing and not part of the public API. Do not modify any of its values.
If you find it useful, let me know.
- get_indices()¶
Return the candidate indices as an array.array of integers
- get_ranks()¶
Return the candidate ranks as an array.array of integers
- class chemfp.diversity.SphereExclusionPicker¶
Bases:
object
An implementation of the sphere picker algorithm, optionally directed
The constructor must not be called directly. Instead, use one of:
Once you have a picker, use
pick_n()
,pick_n_with_counts()
orpick_n_with_neighbors()
to pick the next n candidates, optionally also with the number of fingerprints within its sphere, or with the information about those fingerprints stored in aNeighbors
object.Alternatively, use
SphereExclusionPicker.iter_indices()
oriter_ids()
, to pick the next candidate, yielding either the pick index or pick id; or useiter_indices_and_counts()
oriter_ids_and_counts()
to also include the counts; or useiter_indices_and_neighbors()
oriter_ids_and_neighbors()
to also include theNeighbors
for each sphere.The sphere picker uses OpenMP to parallelize neighbor identification with one thread per popcount bin. I’ve found that the default number of threads is likely too small, and something like 30 or more threads can be faster.
- candidates¶
Get access to the remaining candidates as a
chemfp.diversity.SphereExclusionCandidates
NOTE: This is not part of the public API.
- static from_candidates(candidate_arena, *, threshold=0.4, randomize=None, seed=-1, ranks=None, num_threads=-1)¶
Use sphere exclusion to pick diverse fingerprints from the candidate arena
Each new pick from candidate_arena will be less than threshold similar to any previous pick. The effective sphere radius is 1 - threshold
By default randomize is None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
, otherwise it must be a positive integer.- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick fromthreshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
ranks (None, or an array of unsigned 32-bit integers) – rank values for each candidate (optional)
num_threads (int) – the number of threads to use
- Returns:
- static from_candidates_and_initial_pick(candidate_arena, initial_pick, *, threshold=0.4, randomize=None, seed=-1, ranks=None, num_threads=-1)¶
Use sphere exclusion to pick diverse fingerprints from the candidate arena, starting with an intial pick
This is a short-cut for:
from_candidates_and_initial_picks(candidate_arena, [initial_pick], ...)
See
SphereExclusionPicker.from_candidates_and_initial_picks()
for full details.- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick frominitial_pick (an integer) – the initial pick, as an index into the candidate arena
threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
ranks (None, or an array of unsigned 32-bit integers) – rank values for each candidate (optional)
num_threads (int) – the number of threads to use
- Returns:
- static from_candidates_and_initial_picks(candidate_arena, initial_picks, *, threshold=0.4, randomize=None, seed=-1, ranks=None, num_threads=-1)¶
Use sphere exclusion to pick diverse fingerprints from the candidate arena, starting with an intial pick list
Each new pick from candidate_arena will be less than threshold similar to any previous pick. The effective sphere radius = 1 - threshold
Use initial_picks to specify the initial picks. If a specified candidate index was picked by an ealier candidate index then pick will still occur but the new candidate index will not be included in the count nor the neighbors.
By default randomize = None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
, otherwise it must be a positive integer.- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick frominitial_picks (a list or array of integers) – the initial picks, as indicies into the candidate arena (duplicates are ignored)
threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
ranks (None, or an array of unsigned 32-bit integers) – rank values for each candidate (optional)
num_threads (int) – the number of threads to use
- Returns:
- static from_candidates_and_references(candidate_arena, reference_arena, *, threshold=0.4, randomize=None, seed=-1, ranks=None, num_threads=-1)¶
Use sphere exclusion to pick diverse fingerprints from the candidate arena which are also diverse from the reference arena
Each new pick from candidate_arena will be less than threshold similar any previous pick and any fingerprint in reference_arena. The effective sphere radius = 1 - threshold.
By default randomize = None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
, otherwise it must be a positive integer.- Parameters:
candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing the candidate fingerprints to pick fromreference_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one non-empty fingerprint) – an arena containing reference fingerprintsthreshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
randomize (True for random selection, False for deterministic) – select the next candidate at random from the possible candidates
seed (a value between 0 and 2**64-1, or -1) – initial RNG seed, or -1 (the default) to seed from Python’s RNG
ranks (None, or an array of unsigned 32-bit integers) – rank values for each candidate (optional)
num_threads (int) – the number of threads to use
- Returns:
- iter_ids()¶
Iteratively make a pick, yielding the candidate id each time
- iter_ids_and_counts()¶
Iteratively make a pick, yielding (candidate id, sphere membership count) each time
- iter_ids_and_neighbors()¶
Iteratively make a pick, yielding (candidate id, sphere neighbors) each time
The neighbors are a
Neighbors
instance describing the (excluded) fingerprints within the given sphere.
- iter_indices()¶
Iteratively make a pick, yielding the candidate index each time
- iter_indices_and_counts()¶
Iteratively make a pick, yielding (candidate index, sphere membership count) each time
- iter_indices_and_neighbors()¶
Iteratively make a pick, yielding (candidate index, sphere neighbors) each time
The neighbors are a
Neighbors
instance describing the (excluded) fingerprints within the given sphere.
- pick_n(n, timeout=None)¶
Pick up to n candidate fingerprints
Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progess bars and status updates.
- Parameters:
n (a non-negative integer) – the number of candidates to pick
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- pick_n_with_counts(n, timeout=None)¶
Pick up to n candidate fingerprints, and the number of fingerprints in its sphere
The count includes the candidate fingerprint.
Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progess bars and status updates.
- Parameters:
n (a non-negative integer) – the number of candidates to pick
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- pick_n_with_neighbors(n, timeout=None)¶
Pick up to n candidate fingerprints, and the neighbor fingerprints in its sphere
The fingerprints in the sphere will include the candidate fingerprint unless it was an initial pick and found in an earlier initial pick.
Use timeout to stop picking after the given number of seconds has elapsed. This is primarily meant for interactive use like progess bars and status updates.
- Parameters:
n (a non-negative integer) – the number of candidates to pick
timeout (None for no maximum time, or a non-negative float) – stop picking after the given number of seconds
- Returns:
- picks¶
Get access to all of the picks so far (including initial picks) as a
chemfp.diversity.Picks
- threshold¶
Return the specified threshold value
- chemfp.diversity.get_dise_ranker(*, dise_arena=None, smiles_list=None, fptype=None, reader_args=None)¶
Create a
DISERanker
If dise_arena is not None then it must be a fingerprint arena containing the reference fingerprints for DISE ranking.
If smiles_list is not None then it must be a list of SMILES strings used to generate the DISE arena, with the given fptype fingerprint type and optional reader_args.
Otherwise, use fptype to generate a DISE arena with the three SMILES strings in Gobbi, A., Lee, M. L. (2003). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317-323. https://doi.org/10.1021/ci025554v
- Parameters:
dise_arena (a
FingerprintArena
) – the reference fingerprints for DISE rankingsmiles_list (a list of SMILES string) – SMILES strings used for DISE ranking
fptype (if required, a string or
FingerprintType
) – the fingerprint type used to process the SMILES stringreader_args (None, or a dictionary) – reader arguments for parsing the SMILES string
- Returns:
- chemfp.diversity.get_dise_ranks(candidates_arena, *, dise_arena=None, smiles_list=None, fptype=None, reader_args=None, rng=None)¶
Rank the candidate fingerprints based on the DISE method
The candidate fingerprints in candidates_arena are ranked by the similarity to the first DISE reference fingerprint. A fingerprint with a higher similarity has a lower rank value. Ties are broken by similarity to the second reference fingerprint, etc. The lowest rank value is 0.
This is based on the method described in Gobbi, A., Lee, M. L. (2003). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317-323. https://doi.org/10.1021/ci025554v
If dise_arena is not None then it used as the DISE reference fingerprints.
If smiles_list is not None then it must be a list of SMILES strings used to generate the DISE reference fingerprints. If smiles_list is None then the three SMILES from the Gobbi and Lee paper are used.
If fptype is specified, it is used to to generate the fingerprints from the SMILES strings, otherwise the fingerprint type from candidates_arena is used. The reader_args is passed to the appropriate SMILES parser.
- Parameters:
candidates_arena (a
FingerprintArena
) – the fingerprints to rankdise_arena (None or a
FingerprintArena
) – the reference fingerprints for DISE rankingsmiles_list (a list of SMILES string) – SMILES strings used for DISE ranking
fptype (if required, a string or
FingerprintType
) – the fingerprint type used to process the SMILES stringreader_args (None, or a dictionary) – reader arguments for parsing the SMILES string
- Returns:
an array of integers
- chemfp.diversity.get_heapsweep_picker(candidate_arena, *, randomize=True, seed=-1)¶
Create a
HeapSweepPicker
to pick from candiate_arenaIf randomize is True (the default), the candidates are shuffled before the heapsweep algorithm starts. Shuffling should only affect the ordering of fingerprints with identical diversity scores. It is True by default so the first picked fingerprint is the same as
MaxMinPicker.from_candidates()
. Setting to False should generally be slightly faster.The shuffle and heapsweep methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
FingerprintArena
with popcount indices and at least one non-empty fingerprint) – the candidates to pick fromrandomize (bool) – True to shuffle the initial order, else False
seed (an integer between -1 and 2*64-1) – the RNG seed, or -1 to seed from Python’s RNG
- Returns:
- chemfp.diversity.get_maxmin_picker(candidate_arena, *, reference_arena=None, initial_pick=None, randomize=True, seed=-1)¶
Create a
MaxMinPicker
to pick from candiate_arenaIf initial_pick and reference_arena are not specified then the initial pick is selected using the heapsweep algorithm, which finds a fingerprint with the smallest maximum Tanimoto to any other fingerprint. Use initial_pick to specify the initial pick, either as a string (which is treated as a candidate id) or as an integer (which is treated as a fingerprint index).
If reference_arena is not None then any picked candidate fingerprint must also be dissimilar from all of the fingerprints in the reference fingerprints. The model behind the terms is that you want to pick diverse fingerprints from a vendor catalog which are also diverse from your in-house reference compounds.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.
The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
- Parameters:
candidate_arena (a
FingerprintArena
with popcount indices and at least one non-empty fingerprint) – the candidates to pick fromreference_arena (None, or a fingerprint arena) – avoid candidates which are close to the references
initial_pick (None, or an integer) – the initial candidate index to pick
randomize (bool) – True to randomize the initial order, else False
seed (an integer between -1 and 2*64-1) – the RNG seed, or -1 to seed from Python’s RNG
- Returns:
- chemfp.diversity.get_sphere_exclusion_picker(candidate_arena, *, reference_arena=None, initial_pick=None, initial_picks=None, threshold=0.4, ranks=None, randomize=None, seed=-1, num_threads=-1)¶
Create a
SphereExclusionPicker
to pick from candiate_arenaEach picked fingerprint removes all candidate fingerprints which are at least threshold similar to the picked fingerprint from future consideration (the sphere radius = 1 - threshold).
At most one of initial_pick, initial_picks, or reference_arena may be specified. The initial_pick is the index of the first pick, initial_picks is a list of indices, and reference_arena is a set of fingerprints to avoid (picked fingerprints will not be threshold similar to any fingerprint in reference_arena.)
If ranks is not specified and randomize is None (the default) or True then the picked fingerprint is chosen at random from the remaining candidates. If randomize is False then the fingerprint with the lowest index is selected. (Because of chemfp arena ordering, this will have the smallest number of bits set.)
If ranks is specified then the fingerprint is picked from the remaining fingerprints with the lowest rank. If randomize is None (the default) or False then the picked fingerprint with the lowest index is selected. If randomize is True then a random fingerprint with the lowest rank is picked. (Note: the implementation is O(n) in the number of duplicate ranks, on the assumption that nearly all ranks are different.)
The randomization methods depend on an RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
, otherwise it must be a positive integer.- Parameters:
candidate_arena (a fingerprint arena) – the candidates to pick from
reference_arena (None, or a fingerprint arena) – avoid candidates which are close to the references
initial_pick (None, or an integer) – the initial candidate index to pick
initial_picks (None, or a list of candidate indices) – the initial candidate indices to pick
threshold (a float between 0.0 and 1.0) – similarity threshold to exclude picks
ranks (None, or a list of candidate indices) – initial ranks for directed sphere picking (smallest numbers picked first)
randomize (bool or None) – None for the default, True to pick at random, False to pick the lowest index
seed (an integer between -1 and 2*64-1) – the RNG seed, or -1 to seed from Python’s RNG
num_threads – the number of threads to use
- Returns: