chemfp.highlevel.arena_tools module

This module should not be imported directly.

It contains internal implementation details of the high-level API available from the top-level chemfp module.

This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class chemfp.highlevel.arena_tools.SimHistogramResult(processor, *, times: dict, queries_close=None, targets_close=None)

Bases: object

The result from simhistogram() histogram generation.

This objects acts like a two-element tuple which matches the NumPy histogram API:

>>> import chemfp
>>> a = chemfp.load_fingerprints("chembl_35.fpb")
>>> bins, edges = chemfp.simhistogram(targets=a)
>>> 
>>> import matplotlib.pyplot as plt
>>> plt.stairs(bins, edges)
>>> plt.show()

This object is a context manager which closes any files open when the queries or targets was a filename.

The public attributes are:

NxN: bool

True for a symmetric histogram, otherwise False.

closed: bool

True if close() has been called, otherwise False.

edges: array.array('d')

The edge locations for the bins. If there are B bins then there are B+1 edges, with values [0.0, 1/B, 2/B, .. 1.0].

bins: array.array('Q')

The bin counts.

num_identical: int

The number of evaluated pairs with a score of 1.0.

num_processed: int

The number of elements processed. It will be num_samples for a sample histogram and total_size for a full histogram.

num_samples: int

The number of samples for a sample histogram, or 0.

sampled: bool

True if this is a sample histogram, otherwise False.

seed: int

The seed used for a sample histogram, otherwise 0. If the simhistogram seed was -1 then this attribute contains the value from Python’s random.randrange(2**32)

times: dict[str, float | None]

A dictionary with elapsed times for different parts of the histogram generation, mapping string labels to elapsed time in seconds, or None if not relevant. The labels are:

  • load_queries - the time to load the queries

  • load_targets - the time to load the targets

  • init - the time to initialize the underlying processor

  • process - the time to compute the histogram counts

  • total - the total elapsed time

total_size: int

The total number of possible pairs. For symmetric search this is the size of the upper triangle (without the diagonal), which is N*(N-1)/2. For NxM search this is the N*M, that is the product of the query and target sizes.

close() None

Close any files which may be open and set the processor to None

If queries or targets is a memory-mapped FPB file then the respective arena keeps an open file handle so fingerprint and identifier lookups continue to work.

Call this close() to close them explicitly, or use this object as a context manager to close them when exiting the context.

The close() method also sets the processor to None because its queries and targets arena may refer to those open files.

The close() method may be called multiple times.

get_description(include_times: bool = True) str

Return a human-readable description of the histogram generation.

Parameters:

include_times (bool) – if True, (the default), include the histogram generation time and the full time.

Returns:

str

get_times_description() str

Return string containing a human-readable description of the timing details.

property queries

The query arena (if present)

Returns None if the SimHistogramResult is closed.

stairs(*, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, as_percent: bool = False, cumulative: bool = False, orientation: Literal['vertical', 'horizontal'] = 'vertical', fill: bool = False, label: str | None = None, color: Any = None, include_identical: bool = True, identical_marker: Any = 'o', identical_color: Any = None, identical_label: str | None = None, ax: matplotlib.axes.Axes | None = None, **kwargs)

Generate a matplotlib stairs plot from the histogram data.

By default it shows a vertically oriented stair plot with the Tanimoto scores on the x-axis and counts on the y-axis, along with a marker showing the number of Tanimoto scores which are exactly 1.0. There is one step for each bin, centered on the middle of the bin. The default title and axis labels can be configured or disabled by passing the empty string “”.

Use as_percent to show the count as apercentage relative to num_processed, from 0% to 100%. Use cumulative to show the cumulative counts or percentages.

To not show the identical marker use include_identical=False. Its marker, color, and label are configurable.

The fill, color, and **kwargs are passed to the underlying matplotlib stairs function. For details see https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.stairs.html

Parameters:
  • title (None or a string) – The plot title. If None, create a title using the histogram parameters. Use the empty string for no title.

  • xlabel (None or a string) – The label for the x-axis. If None, create a default label. Use the empty string for no label.

  • ylabel (None or a string) – The label for the y-axis. If None, create a default label. Use the empty string for no label.

  • as_percent (bool) – if True, the y-axis shows the percentage relative to num_processed (from 0% to 100%) instead of the count.

  • cumulative (bool) – if True, use the cumulative bin values instead of the individual bin values.

  • orientation ("vertical" or "horizontal") – If ‘horizontal’ draw the bars horizontally instead of the ‘vertical’ default.

  • fill (bool) – If True, use facecolor or color to fill in the stairs instead of using only a line.

  • label (None or a string) – A label for the stairs

  • color (None or a matplotlib color) – A matplotlib color, used as the default color if the facecolor or edgecolor StepPatch kwargs is not present.

  • include_identical (bool) – If False, do not include information about the number or percentage of scores which are 1.0.

  • identical_marker (string) – The matplotlib marker to use for the identical indicator.

  • identical_color (None or a matplotlib color) – A matplotlib color for the identical indicator.

  • identical_label (None or a string) – The label for the legend for the identical indicator. If None, create one from the histogram parameters.

  • ax (None or a matplotlib Axes) – The matplotlib Axes to use instead of making a new subplot from matplotlib.pyplot.figure().

  • **kwargs – The kwargs passed to matplotlib’s stairs, passed in turn to StepPatch. These include hatch, and edgecolor.

property targets

The target arena

This is also the arena used in NxN generation.

Returns None if the SimHistogramResult is closed.

to_pandas(*, columns=['start', 'end', 'count', 'percent'], identity_bin=False) pandas.DataFrame

Return the histogram at a Pandas DataFrame

The DataFrame will contain one row per bin, and four columns:

  • “start” - the similarity value for the start of the bin;

  • “end” - the similarity value for the end of the bin;

  • “count” - the number of scores found for that bin;

  • “percent” - the value of “count” / num_processed, or

    0.0 if nothing was processed.

If identity_bin is False (the default), then the last bin includes the count of scores which are exactly 1.0. If it is True then that count is not included as part of the last bin. Instead, an output row is added with start = end = 1.0 which only includes the identify count.

Use columns to change the default output column titles.

Parameters:
  • columns – The list of DataFrame column titles to use.

  • identity_bin (bool) – Should the count of scores which are 1.0 be included in the final bin (False) or their own row (True)?

Returns:

a pandas DataFrame