Top-level API

The following functions and classes are in the top-level chemfp module. See Getting started with the API for examples.

chemfp.cdk

This is a special object which forwards any use to the chemfp.cdk_toolkit. It imports the underlying module as-needed so may raise an ImportError. It is designed to be used as chemfp.cdk, like the following:

import chemfp
fp = chemfp.cdk.pubchem.from_smiles("CCO")

Please do not import “cdk” directly into your module as you are likely to get confused with CDK’s own “cdk” module. Instead, use one of the following:

from chemfp import cdk_toolkit
from chemfp import cdk_toolkit as T
chemfp.openeye

This is a special object which forwards any use to the chemfp.openeye_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.openeye, like the following:

import chemfp
fp = chemfp.openeye.circular.from_smiles("CCO")

Please do not import “openeye” directly into your module as you are likely to get confused with OpenEye’s own “openeye” module. Instead, use one of the following:

from chemfp import openeye_toolkit
from chemfp import openeye_toolkit as T
chemfp.openbabel

This is a special object which forwards to the chemfp.openbabel_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.openbabel, like the following:

import chemfp
fp = chemfp.openbabel.fp2.from_smiles("CCO")

Please do not import “openbabel” directly into your module as you are likely to get confused with Open Babel’s own “openbabel” modules. Instead, use one of the following:

from chemfp import openbabel_toolkit
from chemfp import openbabel_toolkit as T
chemfp.rdkit

This is a special object which forwards to the chemfp.rdkit_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.rdkit, like the following:

import chemfp
fp = chemfp.rdkit.morgan(fpSize=128).from_smiles("CCO")

Please do not import “rdkit” directly into your module as you are likely to get confused with CDK’s own “rdkit” module. Instead, use one of the following:

from chemfp import rdkit_toolkit
from chemfp import rdkit_toolkit as T
chemfp.__version__

A string describing this version of chemfp. For example, “4.2”.

chemfp.__version_info__

A 3-element tuple of integers containing the (major version, minor version, micro version) of this version of chemfp. For example, (4, 2, 0).

chemfp.SOFTWARE

The value of the string used in output file metadata to describe this version of chemfp. For example, “chemfp/4.2 (base license)”.

exception chemfp.ChemFPError

Bases: Exception

Base class for all of the chemfp exceptions

exception chemfp.ChemFPProblem(severity: Literal['info', 'warning', 'error'], category: str, description: str)

Bases: ChemFPError

Information about a compatibility problem between a query and target.

Instances are generated by chemfp.check_fingerprint_problems() and chemfp.check_metadata_problems().

The public attributes are:

severity: str

One of “info”, “warning”, or “error”.

error_level: int

5 for “info”, 10 for “warning”, and 20 for “error”

category: str

A category name. This string will not change over time.

The current category names are:
  • “num_bits mismatch” (error)

  • “num_bytes_mismatch” (error)

  • “type mismatch” (warning)

  • “aromaticity mismatch” (info)

  • “software mismatch” (info)

description: str

A more detailed description of the error, including details of the mismatch. The description depends on query_name and target_name and may change over time.

exception chemfp.EncodingError

Bases: ChemFPError, ValueError

Exception raised when the encoding or the encoding_error is unsupported or unknown

class chemfp.FingerprintIterator(metadata: Metadata, id_fp_iterator: _typing.IdAndFingerprintIter, location: _typing.OptionalLocation = None, close: _Optional[_typing.CloseType] = None)

Bases: FingerprintReader

A chemfp.FingerprintReader for an iterator of (id, fingerprint) pairs

This is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed.

The attributes are:

  • metadata - a Metadata describing the fingerprints

  • location - a Location describing file processing

  • closed - False if the underlying file is open, otherwise False

A FingerprintIterator is a context manager which will close the underlying iterator if it’s given a close handler.

Like all iterators you can use next() to get the next (id, fingerprint) pair.

close() None

Close the iterator.

The call will be forwarded to the close callable passed to the constructor. If that close is None then this does nothing.

class chemfp.FingerprintReader(metadata: Metadata)

Bases: object

Base class for all chemfp objects holding fingerprint records

All FingerprintReader instances have a metadata attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.

get_fingerprint_type() _typing.FingerprintType

Get the fingerprint type object based on the metadata’s type field

This uses self.metadata.type to get the fingerprint type string then calls chemfp.get_fingerprint_type() to get and return a chemfp.types.FingerprintType instance.

This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.

Returns:

a chemfp.types.FingerprintType

iter_arenas(arena_size: _OptionalInt = 1000) _typing.FingerprintArenaIterator

iterate through arena_size fingerprints at a time, as subarenas

Iterate through arena_size fingerprints at a time, returned as chemfp.arena.FingerprintArena instances. The arenas are in input order and not reordered by popcount.

This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.

If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.

Parameters:

arena_size (positive integer, or None) – The number of fingerprints to put into each arena.

Returns:

an iterator of chemfp.arena.FingerprintArena instances

load(*, reorder: bool = True, alignment: _OptionalInt = None, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena

Load all of the fingerprints into an arena and return the arena

Parameters:
  • reorder (True or False) – Specify if fingerprints should be reordered for better performance

  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a chemfp.arena.FingerprintArena instance

save(destination: str | bytes | Path | None | BinaryIO, format: str | None = None, level: None | int | Literal['min', 'default', 'max'] = None) None

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.

Parameters:
  • destination (a filename, file object, or None) – the output destination

  • format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format

  • level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files

Returns:

None

class chemfp.FingerprintWriter

Bases: object

Base class for the fingerprint writers

The three fingerprint writer classes are:

If the chemfp_converters package is available then its FlushFingerprintWriter will be used to write fingerprints in flush format.

Use chemfp.open_fingerprint_writer() to create a fingerprint writer class; do not create them directly.

Fingerprint writers are their own context manager, and close the writer on context exit, or you can call close explicitly.

All classes have the following attributes:

metadata: Metadata

A chemfp.Metadata instance or None.

format: str

A string describing the base format type (without compression); either ‘fps’ or ‘fpb’ for chemfp’s writers.

closed: bool

False when the file is open, else True

close() None

Close the writer

This will set self.closed to False.

write_fingerprint(id: str, fp: bytes) None

Write a single fingerprint record with the given id and fp to the destination

Parameters:
  • id (string) – the record identifier

  • fp (byte string) – the fingerprint

write_fingerprints(id_fp_pairs: Iterator[Tuple[str, bytes]]) None

Write a sequence of (id, fingerprint) pairs to the destination

Parameters:

id_fp_pairs – An iterable of (id, fingerprint) pairs. id is a string and fingerprint is a byte string.

class chemfp.Fingerprints(metadata: Metadata, id_fp_pairs: Tuple[Tuple[str, bytes]])

Bases: FingerprintReader

A chemfp.FingerprintReader containing a metadata and a list of (id, fingerprint) pairs.

This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API.

This implements a simple list-like collection of fingerprints. It supports:

  • iteration: for (id, fingerprint) in fingerprints: …

  • indexing: id, fingerprint = fingerprints[1]

  • length: len(fingerprints)

More features, like slicing, will be added as needed or when requested.

class chemfp.Metadata(num_bits: _OptionalInt = None, num_bytes: _OptionalInt = None, type: _OptionalStr = None, aromaticity: _OptionalStr = None, software: _OptionalStr = None, sources: _typing.Optional[_typing.FilenameOrNames] = None, date: _typing.MetadataDateType = None)

Bases: object

Store information about a set of fingerprints

The public attributes are:

num_bits: int or None

The number of bits in the fingerprint.

num_bytes: int or None

The number of bytes in the fingerprint.

type: str or None

The fingerprint type string.

aromaticity: str or None

The aromaticity model (only used with OEChem, and now deprecated).

software: str or None

A description of the software used to make the fingerprints.

sources: list of strings

List of sources used to make the fingerprint.

date: a datetime

A datetime timestamp of when the fingerprints were made.

copy(num_bits: _OptionalInt = None, num_bytes: _OptionalInt = None, type: _OptionalStr = None, aromaticity: _OptionalStr = None, software: _OptionalStr = None, sources: _typing.Optional[_typing.FilenameOrNames] = None, date: _typing.MetadataDateType = None) Metadata

Return a new Metadata instance based on the current attributes and optional new values

When called with no parameter, make a new Metadata instance with the same attributes as the current instance.

If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it.

Parameters:
  • num_bits (an integer, or None) – the number of bits in the fingerprint

  • num_bytes (an integer, or None) – the number of bytes in the fingerprint

  • type (string or None) – the fingerprint type description

  • aromaticity (None) – obsolete

  • software (string or None) – a description of the software

  • sources (list of strings, a string (interpreted as a list with one string), or None) – source filenames

  • date (a datetime instance, or None) – creation or processing date for the contents

Returns:

a new Metadata instance

exception chemfp.ParseError(msg: str, location: _typing.OptionalLocation = None)

Bases: ChemFPError, ValueError

Exception raised by the molecule and fingerprint parsers and writers

The public attributes are:

msg: str, Exception

A string or object describing the exception.

location: chemfp.io.Location or None

The current chemfp.io.Location instance, if available.

chemfp.butina(fingerprints: _Optional[_typing.SourceOrArena] = None, *, fingerprints_format: _OptionalStr = None, matrix: _Optional[_typing.SearchResults] = None, matrix_format: _OptionalStr = None, NxN_threshold: float = 0.7, butina_threshold: float = 0.0, seed: int = -1, tiebreaker: _typing.TiebreakerNames = 'randomize', false_singletons: _typing.FalseSingletonNames = 'follow-neighbor', num_clusters: _OptionalInt = None, rescore: bool = True, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1, debug: _typing.Literal[0, 1, 2] = 0) _typing.ButinaClusters

Use the Butina algorithm[1] to cluster fingerprints and/or a similarity matrix.

At least one of fingerprints or matrix must be specified.

fingerprints may be an arena or filename (use fingerprints_format if the format cannot be inferred by the filename extension). matrix may be the results of a chemfp NxN symmetric search or an npz filename containing a saved NxN search (the only supported matrix_format is “npz”).

If matrix is None then butina will compute the NxN similarity matrix of the fingerprints with threshold NxN_threshold. Otherwise the it will use the pre-computed matrix in matrix.

The butina_threshold specifies the threshold for the Butina algorithm. It is 0.0 by default, which makes clustering depend on the NxN_threshold. This is useful when testing different Butina threshold values because the NxN matrix can be computed once, at the lowest reasonable value, with butina_threshold at different, and higher thresholds.

If tiebreaker is “randomize” (the default) then the next picked center will be chosen at random from the available picks. (These are ranked by the total number of neighbors.) If “first” or “last” then the first or last neighbor, in arena index order, is picked.

Use seed to initialize the random number generator. If -1 (the default), butina will use Python’s RNG to get the initial seed. Otherwise this must be an integer between 0 and 2**64-1.

A “false singleton” is a fingerprint with neighbors within butina_threshold similarity but where all of its neighbors were assigned to another centroid. There are three options for how to handle false_singletons. The default, “follow-neighbor”, assigns the false singleton to the same centroid as its first nearest neighbor. (If there are ties, the first neighbor in the chemfp search is used. A future version of butina may switch to a randomly selected neighbor.) Use “keep” to keep the false singleton as its own centroid. If fingerprints are available then use “nearest-center” to assign false singletons to the nearest cluster centroid. [2]

Use num_clusters to reduce the number of clusters to the specified number. The method takes the smallest cluster and assigns all of its members, one-by-one, to the one of the remaining clusters. The fingerprint is assigned to the same cluster as one of its nearest neighbors, so long as that fingerprint isn’t part of the smallest cluster. The process iterates until enough clusters are pruned. This option requires fingerprints.

By default if a fingerprint is reassigned to a new cluster then then its similarity score is re-computed relative to the new cluster center. If rescore is False then the original score will be preserved.

Use progress to enable progress bars. By default it is True.

Use num_threads to specify the number of threads to use. The default of -1 means to use the value of chemfp.get_num_threads().

The debug option writes debug information to stderr. The three settings are 0, 1, and 2. This will be likely be removed after the Butina implementation is better validated.

[1] Butina, JCICS 39.4, pp 747-750 (1999) doi:10.1021/ci9803381 (While Taylor, JCICS 35.1 pp59-67 (1995) doi:10.1021/ci00023a009 describes a similar algorithm, it is not applied to clustering.)

[2] Blomberg, Cosgrove, and Kenny, JCAMD 23, pp 513-525 (2009) doi:10.1007/s10822-009-9264-5 though chemfp’s implementation does not yet support a minimum required center threshold.

Parameters:
  • fingerprints (filename or FingerprintArena) – the fingerprints to cluster

  • fingerprints_format (str or None) – fingerprint file format

  • matrix (None, filename, or a SearchResults) – a pre-computed NxN search result

  • matrix_format (str or None) – the format of the specified matrix filename

  • NxN_threshold (float) – the threshold to use to generate the NxN SearchResults matrix from the input fingerprints

  • butina_threshold (float) – the threshold to use to process the matrix

  • seed (int) – the RNG seed, or -1 to have Python generate the seed

  • tiebreaker ("randomize", "first", or "last") – method to select the next cluster center in case of ties

  • false_singletons ("follow-neighbor", "keep", "nearest-center") – method used to handle false singletons

  • num_clusters (int or None) – prune clusters to no more than the given size

  • rescore (bool) – if True, rescore reassigned fingerprints

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • num_threads (int) – the number of threads to use, or -1 for the default

  • debug (0, 1, or 2) – an internal debug level for debugging

Returns:

chemfp.highlevel.clustering.ButinaClusters

chemfp.cdk2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'CDK-Daylight', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, hashPseudoAtoms: _Optional0or1 = None, pathLimit: _Optional[int] = None, perceiveStereochemistry: _Optional0or1 = None, searchDepth: _Optional[int] = None, size: _Optional[int] = None, implementation: _Optional[_typing.Literal['cdk', 'chemfp']] = None)

Use the CDK to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to the CDK). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in CDK- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “daylight”, a chemfp type string like “CDK-Daylight”, or a FingerprintType. Additional fingerprint-specific values may be passed as function call arguments.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • type (a FingerprintType or string) – the fingerprint type to use

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • reader_args (a dictionary) – the reader arguments

  • id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • hashPseudoAtoms (if True, include pseudo-atoms in the hash calculation) – bool or None

  • pathLimit (maximum number of paths in path enumeration) – int or None

  • perceiveStereochemistry (if True, re-perceive stereochemistry) – bool or None

  • searchDepth (maximum path length) – int or None

  • size (the number of bits in the fingerprint) – int or None

  • implementation (if "chemfp", use chemfp's SMILES and SDF record readerinstead of cdk's built-in reader) – None, “cdk”, or “chemfp”

Returns:

a ConversionInfo

chemfp.check_fingerprint_problems(query_fp: bytes, target_metadata: Metadata, query_name: str = 'query', target_name: str = 'target')

Return a list of compatibility problems between a fingerprint and a metadata

If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the query_fp byte string and the target_metadata then it will return a list containing a ChemFPProblem instance, with a severity level “error” and category “num_bytes mismatch”.

This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:

>>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128))
>>> problems[0].description
'query contains 64 bytes but target has 128 byte fingerprints'

You can change the error message with the query_name and target_name parameters:

>>> import chemfp
>>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128),
...      query_name="input", target_name="database")
>>> problems[0].description
'input contains 64 bytes but database has 128 byte fingerprints'
Parameters:
  • query_fp (byte string) – a fingerprint (usually the query fingerprint)

  • target_metadata (Metadata instance) – the metadata to check against (usually the target metadata)

  • query_name (string) – the text used to describe the fingerprint, in case of problem

  • target_name (string) – the text used to describe the metadata, in case of problem

Returns:

a list of ChemFPProblem instances

chemfp.check_metadata_problems(query_metadata: Metadata, target_metadata: Metadata, query_name: str = 'query', target_name: str = 'target')

Return a list of compatibility problems between two metadata instances.

If there are no probelms then this returns an empty list. Otherwise it returns a list of ChemFPProblem instances, with a severity level ranging from “info” to “error”.

Bit length and byte length mismatches produce an “error”. Fingerprint type and aromaticity mismatches produce a “warning”. Software version mismatches produce an “info”.

This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:

>>> import chemfp
>>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1")
>>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1")
>>> problems = chemfp.check_metadata_problems(m1, m2)
>>> len(problems)
2
>>> print(problems[1].description)
query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1'

You can change the error message with the query_name and target_name parameters:

>>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database")
>>> print(problems[1].description)
input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1'
Parameters:
  • fp (byte string) – a fingerprint

  • metadata (a Metadata instance) – the metadata to check against

  • query_name (string) – the text used to describe the fingerprint, in case of problem

  • target_name (string) – the text used to describe the metadata, in case of problem

Returns:

a list of ChemFPProblem instances

chemfp.convert2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr, input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', fingerprint_kwargs: _typing.OptionalFingerprintKwargs = None, id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True)

Convert a structure file or files to a fingerprint file.

This is the generic conversion function without the toolkit-specific keyword arguments of rdkit2fps(), cdk2fps(), oe2fps() or ob2fps().

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in toolkit- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This can be a chemfp fingerprint type string or fingerprint type object. If it is a string then it is combined with fingerprint_kwargs to get the fingerprint type object.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

By default this will display progress bars while loading files and generating the array. Use progress=False to disable them, or a floating point value to not display a progress bar until the specified number of seconds.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • type (a FingerprintType or string) – the fingerprint type to use

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • reader_args (a dictionary) – the reader arguments

  • id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a ConversionInfo

chemfp.count_tanimoto_hits(queries, targets, threshold: float = 0.7, arena_size: int = 100) Iterator[Tuple[str, int]]

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9):
  print(f"{query_id} has {count} neighbors with at least 0.9 similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tanimoto_hits_fp() or chemfp.search.count_tanimoto_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (a positive integer, or None) – The number of queries to process in a batch

Returns:

iterator of the (query_id, score) pairs, one for each query

chemfp.count_tanimoto_hits_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7) _typing.IdAndCountIter

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6):
    print(f"{fp_id} has {count} neighbors with at least 0.6 similarity")

You may also be interested in chemfp.search.count_tanimoto_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

chemfp.count_tversky_hits(queries, targets, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, arena_size: int = 100) Iterator[Tuple[str, int]]

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tversky_hits(
          queries, targets, threshold=0.9, alpha=0.5, beta=0.5):
  print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tversky_hits_fp() or chemfp.search.count_tversky_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (a positive integer, or None) – The number of queries to process in a batch

Returns:

iterator of the (query_id, score) pairs, one for each query

chemfp.count_tversky_hits_symmetric(fingerprints, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) Iterator[Tuple[str, int]]

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tversky_hits_symmetric(
        arena, threshold=0.6, alpha=0.5, beta=0.5):
    print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity")

You may also be interested in chemfp.search.count_tversky_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

chemfp.get_default_progressbar() None | Callable

Return the current default progress bar, or None for the default behavior

chemfp.get_fingerprint_families(toolkit_name=None) list[_typing.FingerprintFamily]

Return a list of available fingerprint families

Parameters:

toolkit_name (string) – restrict fingerprints to the named toolkit

Returns:

a list of chemfp.types.FingerprintFamily instances

chemfp.get_fingerprint_family(family_name: str) _typing.FingerprintFamily

Return the named fingerprint family, or raise a ValueError if not available

Given a family_name like OpenBabel-FP2 or OpenEye-MACCS166 return the corresponding chemfp.types.FingerprintFamily.

Parameters:

family_name (string) – the family name

Returns:

a chemfp.types.FingerprintFamily instance

chemfp.get_fingerprint_family_names(include_unavailable: bool = False, toolkit_name: str | None = None) list[str]

Return a set of fingerprint family name strings

The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings.

If include_unavailable is True then this will return a set of all of the fingerprint family names, including those which could not be loaded.

The set contains both the versioned and unversioned family names, so both OpenBabel-FP2/1 and OpenBabel-FP2 may be returned.

Parameters:

include_unavailable (True or False) – Should unavailable family names be included in the result set?

Returns:

a set of strings

chemfp.get_fingerprint_type(type: str, fingerprint_kwargs: _typing.OptionalFingerprintKwargs = None) _typing.FingerprintType

Get the fingerprint type based on its type string and optional keyword arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the fingerprint_kwargs dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the fingerprint_kwargs takes precedence.

For example:

>>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

Use get_fingerprint_type_from_text_settings() if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file.

Parameters:
  • type (string) – a fingerprint type string

  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters

Returns:

a chemfp.types.FingerprintType

chemfp.get_fingerprint_type_from_text_settings(type: str, settings: _Optional[_typing.TextSettingsType]) _typing.FingerprintType

Get the fingerprint type based on its type string and optional settings arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the settings dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the type string and the settings dictionary then the settings take precedence.

For example:

>>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3",
...                                                  {"fpSize": "4096"})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

This function is for string settings from a configuration file or command-line. Use get_fingerprint_type() if your fingerprint parameters are Python values.

Parameters:
  • type (string) – a fingerprint type string

  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters

Returns:

a chemfp.types.FingerprintType

chemfp.get_num_threads() int

Return the default number of OpenMP threads to use when num_threads is -1

Several chemfp functions are parallelized using OpenMP, and support a num_threads parameter to specify the number of OpenMP threads to use. If num_threads is -1 (the default) then chemfp uses the value of chemfp.get_num_threads() to get the actual number to use.

This value can be set with set_num_threads(). If it has not been set, it defaults to the value of OpenMP’s omp_get_max_threads() (available in chemfp using from get_omp_num_threads()).

The default value can be specified by the OMP_NUM_THREADS environment variable, and if that is also not set then the default value depends on the OpenMP implementation, and is likely based on the number of available cores.

Use chemfp’s set_num_threads() to set chemfp’s default value.

The value returned is always a positive integer.

If OpenMP is not available then the number of threads is always 1.

Returns:

the default number of OpenMP threads to use

chemfp.get_omp_num_threads() int

Return the number of threads OpenMP uses to create a team.

This function creates a new OpenMP team (with no num_threads clause) and reports the number of threads actually used.

Returns 1 if OpenMP is not available.

chemfp.get_toolkit(toolkit_name: str) _typing.ToolkitType

Return the named toolkit, if available, or raise a ValueError

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” and the named toolkit is available, then it will return chemfp.openbabel_toolkit, chemfp.openeye_toolkit, or chemfp.rdkit_toolkit, respectively.:

>>> import chemfp
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("rdkit")
Traceback (most recent call last):
     ...
ValueError: Unable to get toolkit 'rdkit': No module named rdkit
Parameters:

toolkit_name (string) – the toolkit name

Returns:

the chemfp toolkit

Raises:

ValueError if toolkit_name is unknown or the toolkit does not exist

chemfp.get_toolkit_names() set[str]

Return a set of available toolkit names

The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:

>>> import chemfp
>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
Returns:

a set of toolkit names, as strings

chemfp.has_fingerprint_family(family_name: str) bool

Test if the fingerprint family is available

Return True if the fingerprint family_name is available, otherwise False. The family_name may be versioned or unversioned, like “OpenBabel-FP2/1” or “OpenEye-MACCS166”.

Parameters:

family_name (string) – the family name

Returns:

True or False

chemfp.has_toolkit(toolkit_name: str) bool

Return True if the named toolkit is available, otherwise False

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False.

>>> import chemfp
>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False

The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached.

Parameters:

toolkit_name (string) – the toolkit name

Returns:

True or False

chemfp.heapsweep(candidates: _typing.SourceOrArena, *, candidates_format: _OptionalStr = None, num_picks: int = 1, threshold: float = 1.0, all_equal: bool = False, randomize: bool = True, seed: int = -1, include_scores: bool = True, progress: _typing.ProgressbarOrBool = True)

Use the heapsweep algorithm to pick diverse fingerprints from candidates

The heapsweep algorithm picks fingerprints ordered by their respective maximum Tanimoto score to the rest of the arena, from smallest to largest. It uses a heap to keep track of the current score for each fingerprint (a lower bound to the global maximum score), and a flag specifying if the score is also the upper bound.

For each sweep, if the smallest heap entry is an upper bound, then pick it. Otherwise, find the similarity between the corresponding fingerprint and all other fingerprints in the arena. This sets the global maximum score for the heap entry, and may update the minimum score for the rest of the fingerprints. Update the heap and try again.

This process is repeated until num_picks fingerprints have been picked, or until maximum score for the remaining candidates is greater than threshold or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candidates.

If all_equal is True then additional fingerprints will be picked if they have the same score as pick num_pick.

The default num_picks = 1 and all_equal = False selects a fingerprint with the smallest maximum similarity. This is used as the initial pick for MaxMinPicker.from_candidates(). Use num_picks = 1 and all_equal = True to select all fingerprints with the smallest maximum similarity.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If randomize is True (the default), the candidates are shuffled before the heapsweep algorithm starts. Shuffling should only affect the ordering of fingerprints with identical diversity scores. It is True by default so the first picked fingerprint is the same as MaxMin.from_candidates. Setting to False should generally be slightly faster.

The shuffle and heapsweep methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

The function returns a HeapSweepInfo object with information about what happened. Its picker attribute contains the HeapSweepPicker used.

If include_scores is true then its result attribute is a PicksAndScores instance, otherwise it is a Picks.

If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

Parameters:
  • candidates (filename or FingerprintArena) – the candidate fingerprints for heapsweep picking

  • candidates_format (str or None) – format for the candidates filename

  • num_picks (int) – the number of picks to do

  • threshold (float) – the maximum allowed Tanimoto similarity value

  • all_equal (bool) – if True, continue picking after num_picks if the pick similarity is the same

  • randomize (bool) – if True, shuffle before processing

  • seed (int) – the RNG seed, or -1 to have Python generate the seed

  • include_scores (bool) – if True, include the pick scores

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a HeapSweepSearch or HeapSweepScoreSearch

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries 
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8):
    print(f"{query_id} has {len(hits)} neighbors with at least 0.8 similarity")
    if hits:
        target_id, score = hits[-1]
        print(f"    The least similar is {target_id} with score {score}")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tanimoto_search_fp() or chemfp.search.knearest_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • k (positive integer) – The maximum number of nearest neighbors to find.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (positive integer, or None) – The number of queries to process in a batch

Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.knearest_tanimoto_search_symmetric(fingerprints: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0) _typing.IdAndSearchResultIter

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint id, SearchResult) pairs. The chemfp.search.SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5):
    print(f"{fp_id} has {len(hits)} neighbors, with scores ", end="")
    print(", ".join("{x:.2f}" for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • k (positive integer) – The maximum number of nearest neighbors to find.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries 
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tversky_search(
          queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    if hits:
        target_id, score = hits[-1]
        print("    The least similar is", target_id, "with score", score)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tversky_search_fp() or chemfp.search.knearest_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • k (positive integer) – The maximum number of nearest neighbors to find.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (positive integer, or None) – The number of queries to process in a batch

Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.knearest_tversky_search_symmetric(fingerprints: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0) _typing.IdAndSearchResultIter

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric(
        arena, k=5, threshold=0.5, alpha=0.5, beta=0.5):
    print(f"{fp_id} has {len(hits)} neighbors, with Dice scores ", end="")
    print(", ".join(f"{x:.2f}" for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • k (positive integer) – The maximum number of nearest neighbors to find.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.load_fingerprints(reader: _ReaderType, metadata: _typing.Optional[Metadata] = None, reorder: bool = True, alignment: _OptionalInt = None, format: _OptionalStr = None, allow_mmap: bool = True, *, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena

Load all of the fingerprints into an in-memory FingerprintArena data structure

The function reads all of the fingerprints and identifers from reader and stores them into an in-memory chemfp.arena.FingerprintArena data structure which supports fast similarity searches.

If reader is a string, the None object, or has a read attribute then it, the format, and allow_mmap will be passed to the chemfp.open() function and the result used as the reader. If that returns a FingerprintArena then the reorder and alignment parameters are ignored and the arena returned.

If reader is a FingerprintArena then the reorder and alignment parameters are ignored. If metadata is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is metadata.

Otherwise the reader or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena.

metadata specifies the metadata for all returned arenas. If not given the default comes from the source file or from reader.metadata.

The loader may reorder the fingerprints for better search performance. To prevent ordering, use reorder=False. The reorder parameter is ignored if the reader is an arena or FPB file.

The alignment option specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file.

The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.

Parameters:
  • reader (a string, file object, or (id, fingerprint) iterator) – An iterator over (id, fingerprint) pairs

  • metadata (Metadata) – The metadata for the arena, if other than reader.metadata

  • reorder (True or False) – Specify if fingerprints should be reordered for better performance

  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.

  • format (None, "fps", "fps.gz", "fps.zst", "fpb", "fpb.gz" or "fpb.zst") – The file format name if the reader is a string

  • allow_mmap (True or False) – Allow chemfp to use mmap on FPB files, instead of reading the file’s contents into memory

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

chemfp.arena.FingerprintArena

chemfp.load_fingerprints_from_string(content: _typing.Content, format: str = 'fps', *, reorder: bool = True, alignment: _OptionalInt = None, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena

Load the fingerprints from the content string, in the given format

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format

  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.

If the content is not in FPB format then by default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to False to preserve the input fingerprint order.

If the content is not in FPB format then alignment specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None determines the best alignment based on the fingerprint size and available popcount methods.

The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.

Parameters:
  • content (byte or text string) – The fingerprint data as a string.

  • format (string) – The file format and optional compression. Unicode strings may not be compressed.

  • reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order

  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

chemfp.arena.FingerprintArena

chemfp.load_simarray(source: _typing.Source, *, format: _typing.OptionalSimarrayFormat = None, metadata_source: _typing.Optional[_typing.ExplicitSource], metadata_format: _typing.Literal['npy', None] = None, mmap_mode: _typing.Literal['default', None, 'r', 'r+', 'c'] = 'default') chemfp.simarray_io.SimarrayFileContent

Load the simarray “npy” file or the “bin”+”npy” files

Read the array data from source with possible metadata in metadata_source. Use format and metadata_format to specify the respective formats rather than use the file extension or default value.

A “npy” file must contain three or four matricies, depending on the analysis type. The first array contains the comparision vector or array, the second array contains a JSON-encoded string describing the analysis type, the third contains the target ids (which are the array ids for an NxN symmetric analysis), and the fourth, if it exists, contains the query identifiers.

When using the “npy” format, if mmap_mode is “default” or None then the array will be loaded into memory. If “r”, “r+”, or “c” then it will be memory-mapped in read-only, read-write, or copy-on-write mode,

A “bin” file must contain the raw bytes for the comparison matrix. This requires a metadata source in “npy” format where the first matrix is used only to get its NumPy dtype. The resulting SimarrayContent combines the “bin” array with the metadata and ids from the metadata file.

When using the “bin” format, if mmap_mode is None then the array will be loaded into memory. If “r”, “r+”, or “c” then it will be memory-mapped in read-only, read-write, or copy-on-write mode, respectively. The default value of “default” uses “r”.

Parameters:
  • source (None, a filename, or a file object) – the source containing the array values

  • format (str or None) – the source file format

  • metadata_source (None, a filename, or a file object) – the source of the array metadata

  • metadata_format (str or None) – the format of the array metadata source

  • mmap_mode (str or None) – read the data into memory or use a given memory map mode

Returns:

SimarrayResult

chemfp.maxmin(candidates: _typing.SourceOrArena, *, references: _Optional[_typing.SourceOrArena] = None, initial_pick: _typing.Union[None, int, str] = None, candidates_format: _OptionalStr = None, references_format: _OptionalStr = None, num_picks: int = 1000, threshold: float = 1.0, all_equal: bool = False, randomize: bool = True, seed: int = -1, include_scores: bool = True, progress: _typing.ProgressbarOrBool = True)

Use the MaxMin algorithm to pick diverse fingerprints from candidates

The MaxMin algorithm iteratively picks fingerprints from a set of candidates such that the newly picked fingerprint has the smallest Tanimoto similarity compared to any previously picked fingerprint, and optionally also the smallest Tanimoto similarity to the reference fingerprints.

This process is repeated until num_picks fingerprints have been picked, or until the remaining candidates are greater than threshold similar to the picked fingerprints, or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candiates, from most dissimilar to least. For example, to select all fingerprints with a maximum Tanimoto score of 0.2 then use num_picks = None and threshold = 0.2.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If initial_pick and references are not specified then the initial pick is selected using the heapsweep algorithm, which finds a fingerprint with the smallest maximum Tanimoto to any other fingerprint. Use initial_pick to specify the initial pick, either as a string (which is treated as a candidate id) or as an integer (which is treated as a fingerprint index).

If references is not None then any picked candidate fingerprint must also be dissimilar from all of the fingerprints in the reference fingerprints. The model behind the terms is that you want to pick diverse fingerprints from a vendor catalog which are also diverse from your in-house reference compounds. If references is not a FingerprintArena then it is passed to load_fingerprints(), along with the values of references_format and progress to load the arena.

If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.

The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

The function returns a BaseMaxMinSearch object with information about what happened. Its out attribute contains the MaxMinPicker used. If include_scores is true then its out attribute is a PicksAndScores instance, otherwise it is a Picks.

If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

Parameters:
  • candidates (filename or FingerprintArena) – the candidate fingerprints for sphere picking

  • references (None, filename or FingerprintArena) – candidates must also avoid these reference fingerprints

  • initial_pick – the initial pick, as an index or id

  • candidates_format (str or None) – format for the candidates filename

  • references_format (str or None) – format for the references filename

  • num_picks (int) – the number of picks to pick

  • threshold (float) – the maximum similarity threshold

  • all_equal (bool) – if True, continue picking after num_picks if the pick similarity is the same

  • randomize (bool) – if True, shuffle before processing

  • seed (int) – the RNG seed, or -1 to have Python generate the seed

  • include_scores (bool) – if True, include pick scores

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a MaxMinScoreSearch or MaxMinSearch

chemfp.ob2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'OpenBabel-FP2', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, nBits: _OptionalInt = None)

Use Open Babel to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to Open Babel). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in Open Babel- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “FP2”, a chemfp type string like “OpenBabel-FP2”, or a chemfp type name. Additional fingerprint-specific values may be passed as function call arguments.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • type (a FingerprintType or string) – the fingerprint type to use

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • reader_args (a dictionary) – the reader arguments

  • id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • nBits (number of bits in the fingerprint) – int or None

Returns:

a ConversionInfo

chemfp.oe2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'OpenEye-Path', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, atype: _Optional[_typing.Union[int, str]] = None, btype: _Optional[_typing.Union[int, str]] = None, maxbonds: _OptionalInt = None, maxradius: _OptionalInt = None, minbonds: _OptionalInt = None, minradius: _OptionalInt = None, numbits: _OptionalInt = None)

Use OEChem and OEGraphSim to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in OEChem- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “circular”, a chemfp type string like “OpenEye-Circular”, or a FingerprintType. Additional fingerprint-specific values may be passed as function call arguments.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • type (a FingerprintType or string) – the fingerprint type to use

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • reader_args (a dictionary) – the reader arguments

  • id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • atype (specify the atom type invariants as bitflags) – integer or string

  • btype (specify the bond type invariants as bitflags) – integer or string

  • maxbonds (maximum number of bonds during path enumeration) – int or None

  • maxradius (maximum circular radius) – int or None

  • minbonds (minimum number of bonds during radius enumeration) – int or None

  • minradius (minimum circular radius) – int or None

  • numbits (number of bits in the fingerprint) – int or None

Returns:

a ConversionInfo

chemfp.open(source: _typing.Source, format: _typing.Optional[str] = None, location: _typing.Optional[_typing.Location] = None, allow_mmap: bool = True) FingerprintReader

Read fingerprints from a fingerprint file

Read fingerprints from source, using the given format. If source is a string then it is treated as a filename. If source is None then fingerprints are read from stdin. Otherwise, source must be a Python file object supporting the read and readline methods.

If format is None then the fingerprint file format and compression type are derived from the source filename, or from the name attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in “fps” format.

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format

  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

The optional location is a chemfp.io.Location instance. It will only be used if the source is in FPS format.

If the source is in FPS format then open will return a chemfp.fps_io.FPSReader, which will use the location if specified.

If the source is in FPB format then open will return a chemfp.arena.FingerprintArena and the location will not be used. If allow_mmap is True then chemfp may use mmap to read uncompressed FPB files. If False then chemfp will read the file’s contents into memory, which may give better performance if the FPB file is on a networked file system, at the expense of higher memory use.

Here’s an example of printing the contents of the file:

from chemfp.bitops import hex_encode
reader = chemfp.open("example.fps.gz")
for id, fp in reader:
    print(id, hex_encode(fp))
Parameters:
  • source (A filename string, a file object, or None) – The fingerprint source.

  • format (string, or None) – The file format and optional compression.

  • location (a Location instance, or None) – a location object used to access parser state information

  • allow_mmap (boolean) – if True, use mmap to open uncompressed FPB files, otherwise read the contents

Returns:

a chemfp.fps_io.FPSReader or chemfp.arena.FingerprintArena

chemfp.open_fingerprint_writer(destination: _typing.Destination, metadata: _typing.Optional[Metadata] = None, format: _OptionalStr = None, *, alignment: int = 8, reorder: bool = True, level: _typing.CompressionLevel = None, include_metadata: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None) FingerprintWriter

Create a fingerprint writer for the given destination

The fingerprint writer is an object with methods to write fingerprints to the given destination. The output format is based on the format. If that’s None then the format depends on the destination, or is “fps” if the attempts at format detection fail.

The metadata, if given, is a Metadata instance, and used to fill the header of an FPS file or META block of an FPB file.

If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None for stdout. If the output format is “fpb” then destination must be a filename or seekable file object. A fingerprint writer with compressed FPB output is not supported; use arena.save() instead, or post-process the file.

Use level to change the compression level. The default is 9 for gzip and 3 for ztd. Use “min”, “default”, or “max” as aliases for the minimum, default, and maximum values for each range.

By default the metadata is included in the FPS output. Set include_metadata to False to disable writing the metadata.

Some options only apply to FPB output. The alignment specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to False to preserve the input fingerprint order.

The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn’t enough available free memory. In that case, set max_spool_size to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.)

Use tmpdir to specify where to write the temporary spool files if you don’t want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables.

Some options only apply to FPS output. errors specifies how to handle recoverable write errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record. If include_metadata is false then the FPS metadata (the initial lines starting with ‘#’) are not included.

The location is a Location instance. It lets the caller access state information such as the number of records that have been written.

Parameters:
  • destination (a filename, file object, or None) – the output destination

  • metadata (a Metadata instance, or None) – the fingerprint metadata

  • format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format

  • alignment (positive integer) – arena byte alignment for FPB files

  • reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order

  • level (an integer, the strings "min", "default" or "max", or None for default) – True reorders the fingerprints by popcount, False leaves them in input order

  • include_metadata (a boolean) – if True, include the header metadata in the FPS output

  • tmpdir (string or None) – the directory to use for temporary files, when max_spool_size is specified

  • max_spool_size (integer, or None) – number of bytes to store in memory before using a temporary file. If None, use memory for everything.

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • location (a Location instance, or None) – a location object used to access output state information

Returns:

a chemfp.FingerprintWriter

chemfp.open_from_string(content: _typing.Content, format: _OptionalStr = 'fps', *, location: _typing.OptionalLocation = None) FingerprintReader

Read fingerprints from a content string containing fingerprints in the given format

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format

  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.

The optional location is a chemfp.io.Location instance. It will only be used if the source is in FPS format.

Parameters:
  • content (byte or text string) – The fingerprint data as a string.

  • format (string) – The file format and optional compression. Unicode strings may not be compressed.

  • location (a Location instance, or None) – a location object used to access parser state information

Returns:

a chemfp.fps_io.FPSReader or chemfp.arena.FingerprintArena

chemfp.rdkit2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'RDKit-Morgan', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, bitFlags: _OptionalInt = None, branchedPaths: _Optional0or1 = None, countBounds: _Optional[list[int]] = None, countSimulation: _Optional0or1 = None, fpSize: _OptionalInt = None, fromAtoms: _Optional[list[int]] = None, includeChirality: _Optional0or1 = None, includeRedundantEnvironments: _Optional0or1 = None, includeRingMembership: _Optional0or1 = None, isQuery: _Optional0or1 = None, isomeric: _Optional0or1 = None, kekulize: _Optional0or1 = None, maxDistance: _OptionalInt = None, maxLength: _OptionalInt = None, maxPath: _OptionalInt = None, minDistance: _OptionalInt = None, minLength: _OptionalInt = None, minPath: _OptionalInt = None, min_radius: _OptionalInt = None, nBitsPerEntry: _OptionalInt = None, nBitsPerHash: _OptionalInt = None, numBitsPerFeature: _OptionalInt = None, onlyShortestPaths: _Optional0or1 = None, radius: _OptionalInt = None, rings: _Optional0or1 = None, targetSize: _OptionalInt = None, torsionAtomCount: _OptionalInt = None, use2D: _Optional0or1 = None, useBondOrder: _Optional0or1 = None, useBondTypes: _Optional0or1 = None, useChirality: _Optional0or1 = None, useFeatures: _Optional0or1 = None, useHs: _Optional0or1 = None)

Use RDKit to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in RDKit- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “morgan”, a chemfp type string like “RDKit-Morgan”, or a FingerprintType. Additional fingerprint-specific values may be passed as function call arguments.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • type (a FingerprintType or string) – the fingerprint type to use

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • reader_args (a dictionary) – the reader arguments

  • id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • bitFlags (int or None) – the bitFlags (Avalon)

  • branchedPaths (bool or None) – if True, allow branched paths (RDKit-Fingerprint)

  • countBounds (list[int] or None) – a list of count bounds (AtomPair/3, RDKit-Fingerprint/3, RDKit-Morgan/2, RDKit-Torsion/4)

  • countSimulation (bool or None) – if True, use count simulation (AtomPair/3, RDKit-Fingerprint/3, RDKit-Morgan/2, RDKit-Torsion/4)

  • fpSize (int or None) – the number of bits in the fingerprint

  • fromAtoms (list[int] or None) – list of starting atom indices (AtomPair, RDKit-Fingerprint, Morgan, Torsion)

  • includeChirality (bool or None) – if True, include chirality (AtomPair, Morgan/2, Torsion)

  • includeRedundantEnvironments (bool or None) – if True, include the redundant environments (Morgan)

  • includeRingMembership (bool or None) – if True, include ring membership (Morgan/2)

  • isQuery (bool or None) – if True, treat as a query (Avalon)

  • isomeric (bool or None) – if True, use isomeric SMILES (SECFP)

  • kekulize (bool or None) – if True, use the Kekule SMILES (SECFP)

  • maxDistance (int or None) – the maximum distance between pairs (AtomPair/3)

  • maxLength (int or None) – the maximum distance between pairs (AtomPair/2)

  • maxPath (int or None) – the maximum path length (RDKit-Fingerprint)

  • minDistance (int or None) – the minimum distance between pairs (AtomPair/3)

  • minLength (int or None) – the minimum distance between pairs (AtomPair/2)

  • minPath (int or None) – the minimum path length (RDKit-Fingerprint)

  • min_radius (int or None) – the minimum radius (SEFP)

  • nBitsPerEntry (int or None) – the number of bits to set (AtomPair/2, Torsion/3)

  • nBitsPerHash (int or None) – the number of bits to set (RDKit-Fingerprint/2)

  • numBitsPerFeature (int or None) – the number of bits to set (RDKit-Fingerprint/3)

  • onlyShortestPaths (bool or None) – if True, only use shortest possible paths (Torsion/4)

  • radius (int or None) – circular radius (Morgan, SECFP)

  • rings (bool or None) – include ring information (SEFP)

  • targetSize (int or None) – number of atoms to use in the torsion (Torsion/3)

  • torsionAtomCount (int or None) – number of atoms to use in the torsion (Torsion/4)

  • use2D (bool or None) – if True, use 2D distance matrix, if False use first conformer (AtomPair)

  • useBondOrder (bool or None) – include bond order invariants (RDKit-Fingerprint)

  • useBondTypes (bool or None) – include bond type invariants (Morgan)

  • useChirality (bool or None) – include chirality invariants (Morgan/1)

  • useFeatures (bool or None) – if True, use chemical-feature invariants (Morgan)

  • useHs (bool or None) – if True, include information about the number of hydrogens (RDKit-Fingerprint)

Returns:

a ConversionInfo

chemfp.read_molecule_fingerprints(type: str | Metadata, source: str | bytes | Path | None | BinaryIO = None, format: str | None = None, id_tag: str | None = None, reader_args: Dict[str, Any] | None = None, errors: Literal['strict', 'report', 'ignore'] = 'strict') FingerprintReader

Read structures from source and return the corresponding ids and fingerprints

This returns an chemfp.fps_io.FPSReader which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of type. Structures are read from source, which can either be the structure filename, or None to read from stdin.

type contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like OpenBabel-FP2/1, OpenEye-Path, and OpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond. Default values are used for unspecified parameters. Use a Metadata instance with type and aromaticity values set in order to pass aromaticity information to OpenEye.

If format is None then the structure file format and compression are determined by the filename’s extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise format may be “smi” or “sdf” optionally followed by “.gz” or “.bz2” to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats.

If id_tag is None, then the record id is based on the title field for the given format. If the input format is “sdf” then id_tag specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the “> <ChEBI ID>” line. In that case, use id_tag = "ChEBI ID".

The reader_args is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored.

errors specifies how to handle errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.

Here is an example of using fingerprints generated from structure file:

from chemfp.bitops import hex_encode
fp_reader = chemfp.read_molecule_fingerprints(
       "OpenBabel-FP4/1", "example.sdf.gz")
print("Each fingerprint has", fp_reader.metadata.num_bits, "bits")
for (id, fp) in fp_reader:
  print(id, hex_encode(fp))

See also chemfp.read_molecule_fingerprints_from_string().

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint

  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.

  • format (string, or None to autodetect based on the source) – The file format and optional compression. Examples: “smi” and “sdf.gz”

  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Only valid for SD files. Example: “ChEBI ID”.

  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

Returns:

a chemfp.FingerprintReader

chemfp.read_molecule_fingerprints_from_string(type: str | Metadata, content: str | bytes, format: str, *, id_tag: str | None = None, reader_args: Dict[str, Any] | None = None, errors: Literal['strict', 'report', 'ignore'] = 'strict') FingerprintReader

Read structures from the content string and return the corresponding ids and fingerprints

The parameters are identical to chemfp.read_molecule_fingerprints() except that the entire content is passed through as a content string, rather than as a source filename. See that function for details.

You must specify the format! As there is no source filename, it’s not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content.

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint

  • content (string) – The structure data as a string.

  • format (string) – The file format and optional compression. Examples: “smi” and “sdf.gz”

  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.

  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

Returns:

a chemfp.FingerprintReader

chemfp.sdf2fps(source: str | bytes | Path | None | BinaryIO, destination: str | bytes | Path | None | BinaryIO, *, id_tag: str | None = None, fp_tag: str | None = None, input_format: str | None = None, output_format: str | None = None, metadata: Metadata | None = None, pubchem: bool = False, decoder: None | str | Callable[[str], tuple[int, bytes]] = None, errors: Literal['strict', 'report', 'ignore'] = 'report', id_prefix: str | None = None, id_template: str | None = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: str | None = None, max_spool_size: int | None = None, progress: bool | float | int | None | Callable = True)

Extract and save fingerprints from tag data in an SD file

Use source to specify the input, which may be None for stdin, a file-like object, a filename, or a list of filenames. If input_format is not specified then the filename extension (if available) is used to determine the compression type, defaulting to uncompressed. Possible values for input_format include “sdf”, “sdf.gz”, and “sdf.zst”.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

The id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier. The fp_tag specifies the tag containing the encoded fingerprint. The decoding describes how to decode the fingerprints. It may be one of “binary”, “binary-msb”, “hex”, “hex-lsb”, “hex-msb”, “base64”, “cactvs”, or “daylight”, or a callable object which takes the fingerprint string and returns the (number of bits, fingerprint byte string), or raises a ValueError on failures.

If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.

There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are: {i} (index starting from 1), {i0} (index starting from 0), {recno} (the current record number), {id} (the original id), {clean_id} the id after cleanup, {first_word} (the first word of the first line), and {first_line} (the first line).

Handle record processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an chemfp.io.ErrorHandler.

If metadata is not None then it is used to generate the metadata output in the output file.

If pubchem is true and metadata is None, then a new Metadata will be used, with software as “CACTVS/unknown”, type as “CACTVS-E_SCREEN/1.0 extended=2”, num_bits as 881, and sources containing any source terms which are filenames.

The pubchem option also sets fp_tag to “PUBCHEM_CACTVS_SUBSKEYS” and decoder to “cactvs”, but only if those values aren’t otherwise specified.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the SDF processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

Parameters:
  • source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the SDF structures

  • destination (a filename, file object, or None for stdout) – the output for the fingerprints

  • id_tag (str or None) – use the named data item for the id, otherwise use the title

  • fp_tag (str or None) – use the named data item for the fingerprint, otherwise use the title

  • input_format (a string or None) – if specified, the source file format,

  • output_format (a string or None) – if specified, the destination file format,

  • metadata (a Metadata) – the metadata to use for the output

  • pubchem (bool) – if True, configure for processing a PubChem file

  • decoder (None, str, or Callable[bytes]->(int, bytes)) – a decoder name or callable to convert the fingerprint to a (num_bits, binary_fp) tuple

  • errors (one of “strict”, “report” or “ignore”, or an ErrorHandler) – specify how to handle parse errors

  • id_prefix (a string or None) – a string prepended to each id to create a new id

  • id_template (a string or None) – a string template used to create a new id

  • id_cleanup (bool) – if True, post-process the id to handle special characters

  • overwrite (bool) – if False do not process if the output file exists

  • reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount

  • tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files

  • max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a ConversionInfo

chemfp.set_default_progressbar(progressbar: bool | Callable | None)

Configure the default progress bar

This must be an object implementing the tqdm class behavior or one of the following values:

  • False - do not use a progress bar

  • None or True - use the default progress bar

(False is mapped to the internal “disabled_tqdm” object.)

chemfp.set_num_threads(num_threads: int)

Specify the default number of OpenMP threads that chemfp should use

Several chemfp functions are parallelized using OpenMP, and support a num_threads parameter to specify the number of OpenMP threads to use. If num_threads is -1 (the default) then chemfp uses the value of get_num_threads() to get the actual number to use.

The set_num_threads function changes the default chemfp value to the specified value, if positive.

Use -1 to set to the default number to the value returned by get_omp_num_threads(), which is also chemfp’s initial value.

Otherwise, if the value is 1 or smaller then chefmp’s default number of threads is set to 1.

Parameters:

num_threads (int) – the new number of OpenMP threads to use

chemfp.simarray(*, query_fp: _Optional[bytes] = None, queries: _Optional[_typing.ExplicitSourceOrArena] = None, query_format: _OptionalStr = None, targets: _typing.ExplicitSourceOrArena, target_format: _OptionalStr = None, metric: _typing.MetricNames = 'Tanimoto', as_distance: bool = False, include_lower_triangle: bool = True, out: _typing.OptionalNumPyArray = None, dtype: _typing.SimarrayDType = None, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1) chemfp.highlevel.simarray.SimarrayResult

High-level API to generate a NumPy array containing the all-by-all comparisons

If targets is specified (and query_fp and queries are not) then generate the full NxN comparison matrix for all the fingerprints in queries. Set include_lower_triangle to False to leave the lower triangle as zeros (this is slightly faster than computing the full matrix).

If queries and targets are specified then generate the full NxM comparison matrix between all N fingerprints in the queries with the M fingerprints in the targets.

If query_fp and targets are specified then generate a vector of length N containing the comparison values between the query fingerprint (a byte string) and all N target fingerprints.

The number of fingerprint bits must be 2**15 or smaller. The Dice similarity does not support 2**15 fingerprint bits.

If queries or targets is a filename or file-like object then this function will use load_fingerprints() with the given query_format or target_format to read the file into a chemfp fingerprint arena. The fingerprint order will be preserved.

The standard metrics (specified by metric), and their supported data types (specified by dtype with the default dtype listed first) are:

“Tanimoto” = popcount(fp1 & fp2) / popcount(fp1 | fp2)

dtypes = [float64, float32, rational64, rational32, uint16]

“Dice” = 2 * popcount(fp1 & fp2) / (popcount(fp1) + popcount(fp2))

dtypes = [float64, float32, rational64, rational32, uint16]

“cosine” = popcount(fp1 & fp2) / (popcount(fp1) * popcount(fp2))

dtypes = [float64, float32, uint16]

“Hamming” = popcount(fp1 ^ fp2)

dtypes = [uint16]

The rational64 and rational32 dtypes are two structured NumPy dtypes containing the numerator and denominator terms, as two uint32 and uint16 fields, respectively. These are not necessarily in reduced form (eg, it may store (2, 4) instead of (1, 2)).

The Tanimoto, Dice, and cosine “uint16” similarity scores are computed as floor(65535 * double_score), so 65535 means identity.

If as_distance is True then the Tanimoto, Dice, and cosine similarity scores are turned into a distance by computing 1-score.

The “Sheffield”, “Willett”, and “Daylight” store their results in the “abcd” dtype, which is a 4-element structure NumPy dtype with uint16 fields “a”, “b”, “c”, and “d”. The metric name specifies which convention to use:

Sheffield:
  • “a” = popcount(fp1 & fp2) = the number of on-bits in common

  • “b” = the number of on-bits in fp1 which are off-bits in fp2

  • “c” = the number of on-bits in fp2 which are off-bits in fp1

  • “d” = the number of off-bits in fp1 which are also off-bits in fp2

Willett:
  • “a” = popcount(fp1) = the number of on-bits in the first fingerprint

  • “b” = popcount(fp2) = the number of on-bits in the second fingerprint

  • “c” = popcount(fp1 & fp2) = the number of on-bits in common

  • “d” = the number of off-bits in fp1 which are also off-bits in fp2

Daylight (same as Sheffield with “a” and “c” swapped):
  • “a” = the number of on-bits in fp2 which are off-bits in fp1

  • “b” = the number of on-bits in fp1 which are off-bits in fp2

  • “c” = popcount(fp1 & fp2) = the number of on-bits in common

  • “d” = the number of off-bits in fp1 which are also off-bits in fp2

If out is None then this function creates a zeroed NumPy array to store the scores using the specified metric and dtype.

Otherwise, out must be a NumPy array (or view), with a dtype which is appropriate to the specified metric. (Note: only the field types much match, not the field names.) The number of rows and columns must be large enough for the number of query and target fingerprints.

By default this will display progress bars while loading files and generating the array. Use progress=False to disable them, or a floating point value to not display a progress bar until the specified number of seconds.

Use num_threads to specify the number of threads to use. The default value of -1 means to use the value returned by :func:get_num_threads().

This returns a SimarrayResult instance which can be used to access the query and target arenas, the output array, and any metadata, or to save the result in “npy” format.

Parameters:
  • query_fp (a byte string or None) – a query fingerprint for 1xN search

  • queries (None, a filename, file object, or a FingerprintArena) – the query fingerprints

  • query_format (str or None) – the file format for the queries file

  • targets (None, a filename, file object, or a FingerprintArena) – the target fingerprints

  • target_format (str or None) – the file format for the target file

  • metric (str) – the name of the metric to use

  • as_distance (bool) – if True, use a distance instead of a similarity

  • include_lower_triangle (bool) – if True, also compute the lower triangle

  • out (None or a NumPy array) – the NumPy array or view in which to save the results

  • dtype (None, str, or a NumPy dtype) – the specific data type to compute

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • num_threads (int) – the number of threads to use, or -1 for the default

Returns:

a SimarrayResult

chemfp.simsearch(*, query: _OptionalStr = None, query_fp: _Optional[bytes] = None, query_id: _OptionalStr = None, queries: _Optional[_typing.ExplicitSourceOrArena] = None, query_format: _OptionalStr = None, type: _Optional[_typing.FingerprintTypeOrStr] = None, targets: _typing.Source, target_format: _OptionalStr = None, NxN: bool = False, k: _OptionalInt = None, threshold: _Optional[float] = None, alpha: _Optional[float] = None, beta: _Optional[float] = None, include_lower_triangle: bool = True, ordering: _Optional[_typing.OrderingNames] = None, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1)

High-level API for similarity searches in targets.

Several different search types are supported:

The function returns a BaseSimsearch instance with information about what happened. Its out attribute stores the SearchResult or SearchResults.

If queries or targets is not a fingerprint arena then use load_fingerprints() to load the arena. Use query_format or target_format to specify the format type.

If k is not None then do a k-nearest search, otherwise do a threshold search. If threshold is not None then the threshold is 0.0. If both are None the the defaults are k=3, threshold=0.0.

If alpha = beta = None or 1.0 then use a Tanimoto search, otherwise do a Tversky search with the given values of alpha and beta. If beta is not None then beta is set to alpha.

For NxN threshold search, if include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When False, only compute the upper triangle.

If ordering is not None then the hits will be reordered as specified. The available orderings are:

  • increasing-score - sort by increasing score

  • decreasing-score - sort by decreasing score

  • increasing-score-plus - sort by increasing score, break ties by increasing index

  • decreasing-score-plus - sort by decreasing score, break ties by increasing index

  • increasing-index - sort by increasing target index

  • decreasing-index - sort by decreasing target index

  • move-closest-first - move the hit with the highest score to the first position

  • reverse - reverse the current ordering

If progress is True then use a progress bar to show FPS load progress, and NxN and NxM search progress. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.

Use num_threads to specify the number of threads to use. The default of -1 means to use the value of chemfp.get_num_threads(), otherwise it must be a positive integer.

Parameters:
  • query (a string or None) – a query structure record

  • query_fp (a byte string or None) – a query fingerprint

  • query_id (str or None) – use the corresponding targets fingerprint as a query fingerprint

  • queries (a filename, file object, or FingerprintArena) – the query fingerprints

  • query_format (str or None) – the file format for the query file

  • type (a string, FingerprintType, or None) – the fingerprint type used to convert a query to a fingerprint

  • targets (a filename, file object, or FingerprintArena, or None) – the target fingerprints

  • target_format (str or None) – the file foramt for the target file

  • NxN (bool) – if True, use the targets to search itself

  • k (int or None) – the number of nearest neighbors to find

  • threshold (float or None) – the minimum similarity threshold

  • alpha (float) – the Tversky alpha value

  • beta (float) – the Tversky alpha value

  • include_lower_triangle (bool) – if True and an NxN search, also include the lower triangle

  • ordering (str or None) – the expected output ordering

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

  • num_threads (int) – the number of threads to use, or -1 for the default

Returns:

a SingleQuerySimsearch, MultiQuerySimsearch, or NxNSimsearch

chemfp.spherex(candidates: _typing.SourceOrArena, *, references: _Optional[_typing.ExplicitSourceOrArena] = None, initial_picks: _Optional[_typing.Union[int, str, list[int], list[str]]] = None, candidates_format: _OptionalStr = None, references_format: _OptionalStr = None, num_picks: int = 1000, threshold: float = 0.4, ranks: _Optional[int] = None, dise: bool = False, dise_type: _Optional[_typing.FingerprintTypeOrStr] = None, dise_references: _Optional[_typing.ExplicitSource] = None, dise_references_format: _OptionalStr = None, randomize: _typing.Literal[None, True, False] = None, seed: int = -1, num_threads: _typing.NumThreadsType = -1, include_counts: bool = False, include_neighbors: bool = False, progress: _typing.ProgressbarOrBool = True)

Use sphere picking to select diverse fingerprints from candidates

Sphere picking iteratively picks a fingerprint from a set of candidates such that the fingerprint is not at least threshold similar to any previously picked fingerprint. The process is repeated until num_picks fingerprints are selected or no pickable fingerprints are available.

Several varations of “picks a fingerprint” are supported. If directed sphere exclusion is NOT used, then:

1) The default (randomize = None), or if randomize = True, select the next available candidate at random.

2) If default = False, select the next candidate which has the smallest index in the arena. This biases the picks towards fingerprints with the fewer number of bits set, which are likely fingerprints with lower complexity. It doesn’t appear to be that useful.

Directed sphere exclusion (see the DISE paper by Gobbi and Lee), requires a rank for each fingerprint. The next pick is chosen from one of the fingerprints with the smallest rank. There are three ways to specify the ranks:

A) They can be passed in directly as the ranks array, which must be a list of integers between 0 and 2**64-1.

B) If dise is True then the structures from the DISE paper are used. This requires a chemistry toolkit to generate the reference fingerprints. Use dise_type to specify the fingerprint type to use instead of the one from the candidates.

C) The reference fingerprints for the DISE algorithm may be passed as dise_references. This may be an arena or a fingerprint filename. Use dise_references_format to specify the file format instead of using the extension.

If initial ranks are specified, then there are two additional ways to pick a fingerprint:

3) The default (randomize = None), or if randomize = False, selects the the candidate with the smallest rank, breaking ties by selecting the candidate with the smallest index in the arena.

4) If randomize = True, select randomly from all of the candidates with the smallest rank. NOTE: this method uses a linear search, which may cause quadratic behavior if many fingerprints have the same rank.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If references is not None then any candidate fingerprints which are at least threshold similar to the reference fingerprints are removed before picking starts. If references is not a FingerprintArena then the value is passed to load_fingerprints(), along with the values of references_format and progress to load the arena.

If references is not specified then optionally use initial_picks to specify the initial picks. This may be a candidate id string or integer index into the candidate array, or a list of id strings or integer indices. The list may be in any order and may contain duplicates. (The neighbor sphere will be empty for any duplicates.)

Initial picks are not necessary. If initial_picks is None then the specified picking method is used.

Some of the pick methods use a random number generator, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

Sphere picking in the candidates may be multi-threaded. The default num_threads of -1 uses chemfp.get_num_threads() threads, which depends on the number of CPU cores in your system and is likely too small. My test suggest 30 threads or higher is more effective. The values of 0 and 1 both mean single-threaded.

The function returns a BaseSpherexSearch object with processing information. The picker attribute is the SphereExclusionPicker used. By default the result element is a SpherexSearch instance. If include_counts is true then it is the SpherexCountSearch returned from calling the picker’s pick_n_with_counts(). If include_neighbors is True then the result is the SpherexNeighborSearch returned from calling pick_n_with_neighbors(). include_counts and include_neighbors cannot both be true.

If progress is True then a progress bar will be used to show any FPS file load progress. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar. The sphere picker search does not currently support progress bars.

Parameters:
  • candidates (filename or FingerprintArena) – the candidate fingerprints for sphere picking

  • references (None, filename or FingerprintArena) – candidates must not be near these reference fingerprints

  • initial_picks (None, str, int, list[str], or list[int]) – the initial sphere centers, as indices or ids

  • candidates_format (str or None) – format for the candidates filename

  • references_format (str or None) – format for the references filename

  • num_picks (int) – the number of picks to pick

  • threshold (float) – the maximum sphere exclusion similarity threshold

  • ranks (list[int] or None) – ranking values for the candidates, lowest ranks picked first

  • dise (bool) – if True, use directed sphere exclusion

  • dise_type (None, a string, or a FingerprintType) – specify the fingerprint type to convert the DISE reference structures to fingerprints

  • dise_references (a filename or file object) – the DISE structure or fingerprint source

  • dise_references_format (str) – the structure or fingerprint format

  • randomize (None or bool) – specify how to select the next candidate

  • seed (int) – specify the initial RNG seed, or -1 to have Python generate the seed

  • num_threads (int) – the number of threads to use, or -1 for the default

  • include_counts (bool) – if True, return a SpherexCountSearch with sphere counts

  • include_neighbors (bool) – if True, return a SpherexNeighborSearch with sphere neighbors

  • progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor

Returns:

a SpherexSearch, SpherexCountSearch, or SpherexNeighborSearch

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8):
    print(f"{query_id} has {len(hits)} neighbors with at least 0.8 similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tanimoto_search_fp() or chemfp.search.threshold_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (positive integer, or None) – The number of queries to process in a batch

Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

chemfp.threshold_tanimoto_search_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7) _typing.IdAndSearchResultIter

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint id, SearchResult) pairs. The SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75):
    print(f"{fp_id} has {len(hits)} neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print(f"   {other_id}  {score:.2f}")

You may also be interested in the chemfp.search.threshold_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(
           queries, targets, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tversky_search_fp() or chemfp.search.threshold_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.

  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

  • arena_size (positive integer, or None) – The number of queries to process in a batch

Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

chemfp.threshold_tversky_search_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) _typing.IdAndSearchResultIter

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint id, SearchResult) pairs. The SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric(
           arena, threshold=0.75, alpha=0.5, beta=0.5):
    print(f"{fp_id} has {len(hits)} Dice neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print(f"   {other_id}  {score:.2f}")

You may also be interested in the chemfp.search.threshold_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.

  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.

Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint