chemfp.simarray_io module

Convert a simarray result to/from an npy or bin+npy file

This is an internal module and should not be imported directly. Portions of this module are returned as part of the public API.

“npy” file

A “npy” file contains up to four NumPy arrays, stored sequentially. It is used when the generated matrix is smalle enough to fit into memory.

The first contains the simarray values, or is the equivalent (0,) or (0,0) shaped array with the same dtype when used as a metadata npy file.

The second contains the simarray metadata, encoded as a JSON string, and stored as a single NumPy string.

The file can contain comparisons between:

1) a query fingerprint against a set of target fingerprints, as a 1-D vector, in which case the “format” is “single” and the “matrix_type” is “N”:

{'format': 'single',
 'matrix_type': 'N',
 'method': 'Hamming',
 'metric': {'as_distance': False,
            'is_distance': True,
            'is_similarity': False,
            'name': 'Hamming'},
 'metric_description': 'Hamming distance',
 'num_bits': 166,
 'shape': [50001]}

2) a set of fingerprints against itself, as a 2-D array, including the lower triangle, in which case the “format” is “multiple” and the “matrix_type” is “NxN”:

{'format': 'multiple',
 'matrix_type': 'NxN',
 'method': 'Tanimoto',
 'metric': {'as_distance': False,
            'is_distance': False,
            'is_similarity': True,
            'name': 'Tanimoto'},
 'metric_description': 'Tanimoto similarity',
 'num_bits': 2048,
 'shape': [2500, 2500]}

3) a set of fingerprints against itself, as a 2-D array but leaving the lower-triangle uncomputed (and with the default zero value), in which case the “format” is “multiple” and the “matrix_type” is “upper-triangular”:

{'format': 'multiple',
 'matrix_type': 'upper-triangular',
 'method': 'Dice as_distance=1',
 'metric': {'as_distance': True,
            'is_distance': True,
            'is_similarity': False,
            'name': 'Dice'},
 'metric_description': '1-Dice distance',
 'num_bits': 166,
 'shape': [50001, 50001]}

4) query fingerprints and target fingerprints, as a 2-D array, in which case the “format” is “multiple” and the “matrix_type” is “NxM”:

{'format': 'multiple',
 'matrix_type': 'NxM',
 'method': 'cosine',
 'metric': {'as_distance': False,
            'is_distance': False,
            'is_similarity': True,
            'name': 'cosine'},
 'metric_description': 'cosine similarity',
 'num_bits': 1024,
 'shape': [10000, 15000]}

The third matrix contains the target identifiers. These are the identifiers for the targets in “N” and “NxM” matrix types, and the identifiers for the fingerprints in “NxN” and “upper-triangular” matrix types.

The fourth matrix contains the query identifiers, and is only present for the “NxM” matrix type.

“bin” format and “npy” metadata

The “bin” file is used when the generated matrix is too large to easily fit into memory. It contains the raw bytes for the comparison matrix, without the extra metadata (like dtype and shape) included in an npy file. The elements are stored in row-major order, so entry[0,0] goes first, followed by entry[0,1], etc.

If the dtype and shape are known, then the “bin” file can be loaded into NumPy either using a memory-mapped file, like:

>>> import numpy
>>> arr = numpy.memmap("comparisons.bin", shape=(10000, 15000), dtype=numpy.float64)
>>> arr[:3,:3]
memmap([[0.4265617 , 0.53512955, 0.33881546],
        [0.23424607, 0.63770927, 0.3313645 ],
        [0.3642464 , 0.52223297, 0.27003086]])

or loaded into memory, like:

>>> with open("x.bin", "rb") as f:
...   content = f.read()
...
>>> import numpy
>>> arr = numpy.ndarray(buffer=content, shape=(10000, 15000), dtype=numpy.float64)
>>> arr[:3,:3]
array([[0.4265617 , 0.53512955, 0.33881546],
       [0.23424607, 0.63770927, 0.3313645 ],
       [0.3642464 , 0.52223297, 0.27003086]])

Alternatively, use an auxillary “metadata” npy file to store this metadata and associated identifiers. This is formatted the same as an “npy” file expect that first matrix, which normally contains the full set of comparisons, is a (0,) or (0,0)-sized array used to recover the correct NumPy dtype.

The metadata npy file can be generated in the API using:

simarray_result.save("output.npy", include_values=False)
class chemfp.simarray_io.SimarrayFileContent(out, metadata, query_ids, target_ids, close=None)

Bases: SimarrayContent

close()

Close any associated files and clear any field which may use a file.

If this SimarrayFileContent instance was loaded from a memory-mapped file then it should be explicitly closed as otherwise the garbage collector might try to close the memory-mapped file before closing any array views.

This also sets the out, query_ids, and target_ids attributes to None.

This instance implements a context manager which calls close on exit.

Returns:

None

property dtype_str: Literal['float64', 'float32', 'rational64', 'rational32', 'uint16', 'abcd']

A string describing the chemfp simarray dtype name

property matrix_type: Literal['N', 'NxM', 'NxN', 'upper-triangular']

One of the strings “N”, “NxM”, “NxN”, or “upper-triangular”, describing the contents of self.out

property metric: SimarrayMetric

Information about the metric used, as a SimarrayMetric

property num_bits: int

The number of bits in the fingerprint.