chemfp.highlevel.arena_tools module
This module should not be imported directly.
It contains internal implementation details of the high-level API available from the top-level chemfp module.
This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.
- class chemfp.highlevel.arena_tools.SimHistogramResult(processor, *, times: dict, queries_close=None, targets_close=None)
Bases:
objectThe result from
simhistogram()histogram generation.This objects acts like a two-element tuple which matches the NumPy histogram API:
>>> import chemfp >>> a = chemfp.load_fingerprints("chembl_35.fpb") >>> bins, edges = chemfp.simhistogram(targets=a) >>> >>> import matplotlib.pyplot as plt >>> plt.stairs(bins, edges) >>> plt.show()
This object is a context manager which closes any files open when the queries or targets was a filename.
The public attributes are:
- NxN: bool
True for a symmetric histogram, otherwise False.
- closed: bool
True if close() has been called, otherwise False.
- edges: array.array('d')
The edge locations for the bins. If there are B bins then there are B+1 edges, with values [0.0, 1/B, 2/B, .. 1.0].
- bins: array.array('Q')
The bin counts.
- num_identical: int
The number of evaluated pairs with a score of 1.0.
- num_processed: int
The number of elements processed. It will be
num_samplesfor a sample histogram andtotal_sizefor a full histogram.
- num_samples: int
The number of samples for a sample histogram, or 0.
- sampled: bool
True if this is a sample histogram, otherwise False.
- seed: int
The seed used for a sample histogram, otherwise 0. If the simhistogram seed was -1 then this attribute contains the value from Python’s random.randrange(2**32)
- times: dict[str, float | None]
A dictionary with elapsed times for different parts of the histogram generation, mapping string labels to elapsed time in seconds, or None if not relevant. The labels are:
load_queries - the time to load the queries
load_targets - the time to load the targets
init - the time to initialize the underlying processor
process - the time to compute the histogram counts
total - the total elapsed time
- total_size: int
The total number of possible pairs. For symmetric search this is the size of the upper triangle (without the diagonal), which is N*(N-1)/2. For NxM search this is the N*M, that is the product of the query and target sizes.
- close() None
Close any files which may be open and set the processor to None
If queries or targets is a memory-mapped FPB file then the respective arena keeps an open file handle so fingerprint and identifier lookups continue to work.
Call this close() to close them explicitly, or use this object as a context manager to close them when exiting the context.
The close() method also sets the processor to None because its queries and targets arena may refer to those open files.
The close() method may be called multiple times.
- get_description(include_times: bool = True) str
Return a human-readable description of the histogram generation.
- Parameters:
include_times (bool) – if True, (the default), include the histogram generation time and the full time.
- Returns:
str
- get_times_description() str
Return string containing a human-readable description of the timing details.
- property queries
The query arena (if present)
Returns None if the SimHistogramResult is closed.
- stairs(*, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, as_percent: bool = False, cumulative: bool = False, orientation: Literal['vertical', 'horizontal'] = 'vertical', fill: bool = False, label: str | None = None, color: Any = None, include_identical: bool = True, identical_marker: Any = 'o', identical_color: Any = None, identical_label: str | None = None, ax: matplotlib.axes.Axes | None = None, **kwargs)
Generate a matplotlib stairs plot from the histogram data.
By default it shows a vertically oriented stair plot with the Tanimoto scores on the x-axis and counts on the y-axis, along with a marker showing the number of Tanimoto scores which are exactly 1.0. There is one step for each bin, centered on the middle of the bin. The default title and axis labels can be configured or disabled by passing the empty string “”.
Use as_percent to show the count as apercentage relative to num_processed, from 0% to 100%. Use cumulative to show the cumulative counts or percentages.
To not show the identical marker use include_identical=False. Its marker, color, and label are configurable.
The fill, color, and **kwargs are passed to the underlying matplotlib stairs function. For details see https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.stairs.html
- Parameters:
title (None or a string) – The plot title. If None, create a title using the histogram parameters. Use the empty string for no title.
xlabel (None or a string) – The label for the x-axis. If None, create a default label. Use the empty string for no label.
ylabel (None or a string) – The label for the y-axis. If None, create a default label. Use the empty string for no label.
as_percent (bool) – if True, the y-axis shows the percentage relative to num_processed (from 0% to 100%) instead of the count.
cumulative (bool) – if True, use the cumulative bin values instead of the individual bin values.
orientation ("vertical" or "horizontal") – If ‘horizontal’ draw the bars horizontally instead of the ‘vertical’ default.
fill (bool) – If True, use facecolor or color to fill in the stairs instead of using only a line.
label (None or a string) – A label for the stairs
color (None or a matplotlib color) – A matplotlib color, used as the default color if the facecolor or edgecolor StepPatch kwargs is not present.
include_identical (bool) – If False, do not include information about the number or percentage of scores which are 1.0.
identical_marker (string) – The matplotlib marker to use for the identical indicator.
identical_color (None or a matplotlib color) – A matplotlib color for the identical indicator.
identical_label (None or a string) – The label for the legend for the identical indicator. If None, create one from the histogram parameters.
ax (None or a matplotlib Axes) – The matplotlib Axes to use instead of making a new subplot from matplotlib.pyplot.figure().
**kwargs – The kwargs passed to matplotlib’s stairs, passed in turn to StepPatch. These include hatch, and edgecolor.
- property targets
The target arena
This is also the arena used in NxN generation.
Returns None if the SimHistogramResult is closed.
- to_pandas(*, columns=['start', 'end', 'count', 'percent'], identity_bin=False) pandas.DataFrame
Return the histogram at a Pandas DataFrame
The DataFrame will contain one row per bin, and four columns:
“start” - the similarity value for the start of the bin;
“end” - the similarity value for the end of the bin;
“count” - the number of scores found for that bin;
- “percent” - the value of “count” / num_processed, or
0.0 if nothing was processed.
If identity_bin is False (the default), then the last bin includes the count of scores which are exactly 1.0. If it is True then that count is not included as part of the last bin. Instead, an output row is added with start = end = 1.0 which only includes the identify count.
Use columns to change the default output column titles.
- Parameters:
columns – The list of DataFrame column titles to use.
identity_bin (bool) – Should the count of scores which are 1.0 be included in the final bin (False) or their own row (True)?
- Returns:
a pandas DataFrame