Top-level API¶
The following functions and classes are in the top-level chemfp module. See Getting started with the API for examples.
- chemfp.cdk¶
This is a special object which forwards any use to the
chemfp.cdk_toolkit
. It imports the underlying module as-needed so may raise an ImportError. It is designed to be used aschemfp.cdk
, like the following:import chemfp fp = chemfp.cdk.pubchem.from_smiles("CCO")
Please do not import “cdk” directly into your module as you are likely to get confused with CDK’s own “cdk” module. Instead, use one of the following:
from chemfp import cdk_toolkit from chemfp import cdk_toolkit as T
- chemfp.openeye¶
This is a special object which forwards any use to the
chemfp.openeye_toolkit
. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used aschemfp.openeye
, like the following:import chemfp fp = chemfp.openeye.circular.from_smiles("CCO")
Please do not import “openeye” directly into your module as you are likely to get confused with OpenEye’s own “openeye” module. Instead, use one of the following:
from chemfp import openeye_toolkit from chemfp import openeye_toolkit as T
- chemfp.openbabel¶
This is a special object which forwards to the
chemfp.openbabel_toolkit
. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used aschemfp.openbabel
, like the following:import chemfp fp = chemfp.openbabel.fp2.from_smiles("CCO")
Please do not import “openbabel” directly into your module as you are likely to get confused with Open Babel’s own “openbabel” modules. Instead, use one of the following:
from chemfp import openbabel_toolkit from chemfp import openbabel_toolkit as T
- chemfp.rdkit¶
This is a special object which forwards to the
chemfp.rdkit_toolkit
. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used aschemfp.rdkit
, like the following:import chemfp fp = chemfp.rdkit.morgan(fpSize=128).from_smiles("CCO")
Please do not import “rdkit” directly into your module as you are likely to get confused with CDK’s own “rdkit” module. Instead, use one of the following:
from chemfp import rdkit_toolkit from chemfp import rdkit_toolkit as T
- chemfp.__version__¶
A string describing this version of chemfp. For example, “4.2”.
- chemfp.__version_info__¶
A 3-element tuple of integers containing the (major version, minor version, micro version) of this version of chemfp. For example, (4, 2, 0).
- chemfp.SOFTWARE¶
The value of the string used in output file metadata to describe this version of chemfp. For example, “chemfp/4.2 (base license)”.
- exception chemfp.ChemFPError¶
Bases:
Exception
Base class for all of the chemfp exceptions
- exception chemfp.ChemFPProblem(severity: Literal['info', 'warning', 'error'], category: str, description: str)¶
Bases:
ChemFPError
Information about a compatibility problem between a query and target.
Instances are generated by
chemfp.check_fingerprint_problems()
andchemfp.check_metadata_problems()
.The public attributes are:
- severity: str¶
One of “info”, “warning”, or “error”.
- error_level: int¶
5 for “info”, 10 for “warning”, and 20 for “error”
- category: str¶
A category name. This string will not change over time.
- The current category names are:
“num_bits mismatch” (error)
“num_bytes_mismatch” (error)
“type mismatch” (warning)
“aromaticity mismatch” (info)
“software mismatch” (info)
- description: str¶
A more detailed description of the error, including details of the mismatch. The description depends on query_name and target_name and may change over time.
- exception chemfp.EncodingError¶
Bases:
ChemFPError
,ValueError
Exception raised when the encoding or the encoding_error is unsupported or unknown
- class chemfp.FingerprintIterator(metadata: Metadata, id_fp_iterator: _typing.IdAndFingerprintIter, location: _typing.OptionalLocation = None, close: _Optional[_typing.CloseType] = None)¶
Bases:
FingerprintReader
A
chemfp.FingerprintReader
for an iterator of (id, fingerprint) pairsThis is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed.
The attributes are:
metadata
- aMetadata
describing the fingerprintslocation
- aLocation
describing file processingclosed
- False if the underlying file is open, otherwise False
A FingerprintIterator is a context manager which will close the underlying iterator if it’s given a close handler.
Like all iterators you can use next() to get the next (id, fingerprint) pair.
- close() None ¶
Close the iterator.
The call will be forwarded to the
close
callable passed to the constructor. If thatclose
is None then this does nothing.
- class chemfp.FingerprintReader(metadata: Metadata)¶
Bases:
object
Base class for all chemfp objects holding fingerprint records
All FingerprintReader instances have a
metadata
attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.- get_fingerprint_type() _typing.FingerprintType ¶
Get the fingerprint type object based on the metadata’s type field
This uses
self.metadata.type
to get the fingerprint type string then callschemfp.get_fingerprint_type()
to get and return achemfp.types.FingerprintType
instance.This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.
- Returns:
- iter_arenas(arena_size: _OptionalInt = 1000) _typing.FingerprintArenaIterator ¶
iterate through arena_size fingerprints at a time, as subarenas
Iterate through arena_size fingerprints at a time, returned as
chemfp.arena.FingerprintArena
instances. The arenas are in input order and not reordered by popcount.This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.
If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.
- Parameters:
arena_size (positive integer, or None) – The number of fingerprints to put into each arena.
- Returns:
an iterator of
chemfp.arena.FingerprintArena
instances
- load(*, reorder: bool = True, alignment: _OptionalInt = None, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena ¶
Load all of the fingerprints into an arena and return the arena
- Parameters:
reorder (True or False) – Specify if fingerprints should be reordered for better performance
alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
a
chemfp.arena.FingerprintArena
instance
- save(destination: str | bytes | Path | None | BinaryIO, format: str | None = None, level: None | int | Literal['min', 'default', 'max'] = None) None ¶
Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.
- Parameters:
destination (a filename, file object, or None) – the output destination
format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files
- Returns:
None
- class chemfp.FingerprintWriter¶
Bases:
object
Base class for the fingerprint writers
The three fingerprint writer classes are:
chemfp.fps_io.FPSWriter
- write an FPS filechemfp.fpb_io.OrderedFPBWriter
- write an FPB file, sorted by popcountchemfp.fpb_io.InputOrderFPBWriter
- write an FPB file, preserving input order
If the chemfp_converters package is available then its FlushFingerprintWriter will be used to write fingerprints in flush format.
Use
chemfp.open_fingerprint_writer()
to create a fingerprint writer class; do not create them directly.Fingerprint writers are their own context manager, and close the writer on context exit, or you can call close explicitly.
All classes have the following attributes:
- metadata: Metadata¶
A
chemfp.Metadata
instance or None.
- format: str¶
A string describing the base format type (without compression); either ‘fps’ or ‘fpb’ for chemfp’s writers.
- closed: bool¶
False when the file is open, else True
- close() None ¶
Close the writer
This will set self.closed to False.
- write_fingerprint(id: str, fp: bytes) None ¶
Write a single fingerprint record with the given id and fp to the destination
- Parameters:
id (string) – the record identifier
fp (byte string) – the fingerprint
- write_fingerprints(id_fp_pairs: Iterator[Tuple[str, bytes]]) None ¶
Write a sequence of (id, fingerprint) pairs to the destination
- Parameters:
id_fp_pairs – An iterable of (id, fingerprint) pairs. id is a string and fingerprint is a byte string.
- class chemfp.Fingerprints(metadata: Metadata, id_fp_pairs: Tuple[Tuple[str, bytes]])¶
Bases:
FingerprintReader
A
chemfp.FingerprintReader
containing a metadata and a list of (id, fingerprint) pairs.This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API.
This implements a simple list-like collection of fingerprints. It supports:
iteration: for (id, fingerprint) in fingerprints: …
indexing: id, fingerprint = fingerprints[1]
length: len(fingerprints)
More features, like slicing, will be added as needed or when requested.
- class chemfp.Metadata(num_bits: _OptionalInt = None, num_bytes: _OptionalInt = None, type: _OptionalStr = None, aromaticity: _OptionalStr = None, software: _OptionalStr = None, sources: _typing.Optional[_typing.FilenameOrNames] = None, date: _typing.MetadataDateType = None)¶
Bases:
object
Store information about a set of fingerprints
The public attributes are:
- num_bits: int or None¶
The number of bits in the fingerprint.
- num_bytes: int or None¶
The number of bytes in the fingerprint.
- type: str or None¶
The fingerprint type string.
- aromaticity: str or None¶
The aromaticity model (only used with OEChem, and now deprecated).
- software: str or None¶
A description of the software used to make the fingerprints.
- sources: list of strings¶
List of sources used to make the fingerprint.
- copy(num_bits: _OptionalInt = None, num_bytes: _OptionalInt = None, type: _OptionalStr = None, aromaticity: _OptionalStr = None, software: _OptionalStr = None, sources: _typing.Optional[_typing.FilenameOrNames] = None, date: _typing.MetadataDateType = None) Metadata ¶
Return a new Metadata instance based on the current attributes and optional new values
When called with no parameter, make a new Metadata instance with the same attributes as the current instance.
If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it.
- Parameters:
num_bits (an integer, or None) – the number of bits in the fingerprint
num_bytes (an integer, or None) – the number of bytes in the fingerprint
type (string or None) – the fingerprint type description
aromaticity (None) – obsolete
software (string or None) – a description of the software
sources (list of strings, a string (interpreted as a list with one string), or None) – source filenames
date (a datetime instance, or None) – creation or processing date for the contents
- Returns:
a new
Metadata
instance
- exception chemfp.ParseError(msg: str, location: _typing.OptionalLocation = None)¶
Bases:
ChemFPError
,ValueError
Exception raised by the molecule and fingerprint parsers and writers
The public attributes are:
- msg: str, Exception¶
A string or object describing the exception.
- location: chemfp.io.Location or None¶
The current
chemfp.io.Location
instance, if available.
- chemfp.butina(fingerprints: _Optional[_typing.SourceOrArena] = None, *, fingerprints_format: _OptionalStr = None, matrix: _Optional[_typing.SearchResults] = None, matrix_format: _OptionalStr = None, NxN_threshold: float = 0.7, butina_threshold: float = 0.0, seed: int = -1, tiebreaker: _typing.TiebreakerNames = 'randomize', false_singletons: _typing.FalseSingletonNames = 'follow-neighbor', num_clusters: _OptionalInt = None, rescore: bool = True, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1, debug: _typing.Literal[0, 1, 2] = 0) _typing.ButinaClusters ¶
Use the Butina algorithm[1] to cluster fingerprints and/or a similarity matrix.
At least one of fingerprints or matrix must be specified.
fingerprints may be an arena or filename (use fingerprints_format if the format cannot be inferred by the filename extension). matrix may be the results of a chemfp NxN symmetric search or an npz filename containing a saved NxN search (the only supported matrix_format is “npz”).
If matrix is None then butina will compute the NxN similarity matrix of the fingerprints with threshold NxN_threshold. Otherwise the it will use the pre-computed matrix in matrix.
The butina_threshold specifies the threshold for the Butina algorithm. It is 0.0 by default, which makes clustering depend on the NxN_threshold. This is useful when testing different Butina threshold values because the NxN matrix can be computed once, at the lowest reasonable value, with butina_threshold at different, and higher thresholds.
If tiebreaker is “randomize” (the default) then the next picked center will be chosen at random from the available picks. (These are ranked by the total number of neighbors.) If “first” or “last” then the first or last neighbor, in arena index order, is picked.
Use seed to initialize the random number generator. If -1 (the default), butina will use Python’s RNG to get the initial seed. Otherwise this must be an integer between 0 and 2**64-1.
A “false singleton” is a fingerprint with neighbors within butina_threshold similarity but where all of its neighbors were assigned to another centroid. There are three options for how to handle false_singletons. The default, “follow-neighbor”, assigns the false singleton to the same centroid as its first nearest neighbor. (If there are ties, the first neighbor in the chemfp search is used. A future version of butina may switch to a randomly selected neighbor.) Use “keep” to keep the false singleton as its own centroid. If fingerprints are available then use “nearest-center” to assign false singletons to the nearest cluster centroid. [2]
Use num_clusters to reduce the number of clusters to the specified number. The method takes the smallest cluster and assigns all of its members, one-by-one, to the one of the remaining clusters. The fingerprint is assigned to the same cluster as one of its nearest neighbors, so long as that fingerprint isn’t part of the smallest cluster. The process iterates until enough clusters are pruned. This option requires fingerprints.
By default if a fingerprint is reassigned to a new cluster then then its similarity score is re-computed relative to the new cluster center. If rescore is False then the original score will be preserved.
Use progress to enable progress bars. By default it is True.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
.The debug option writes debug information to stderr. The three settings are 0, 1, and 2. This will be likely be removed after the Butina implementation is better validated.
[1] Butina, JCICS 39.4, pp 747-750 (1999) doi:10.1021/ci9803381 (While Taylor, JCICS 35.1 pp59-67 (1995) doi:10.1021/ci00023a009 describes a similar algorithm, it is not applied to clustering.)
[2] Blomberg, Cosgrove, and Kenny, JCAMD 23, pp 513-525 (2009) doi:10.1007/s10822-009-9264-5 though chemfp’s implementation does not yet support a minimum required center threshold.
- Parameters:
fingerprints (filename or
FingerprintArena
) – the fingerprints to clusterfingerprints_format (str or None) – fingerprint file format
matrix (None, filename, or a
SearchResults
) – a pre-computed NxN search resultmatrix_format (str or None) – the format of the specified matrix filename
NxN_threshold (float) – the threshold to use to generate the NxN
SearchResults
matrix from the input fingerprintsbutina_threshold (float) – the threshold to use to process the matrix
seed (int) – the RNG seed, or -1 to have Python generate the seed
tiebreaker ("randomize", "first", or "last") – method to select the next cluster center in case of ties
false_singletons ("follow-neighbor", "keep", "nearest-center") – method used to handle false singletons
num_clusters (int or None) – prune clusters to no more than the given size
rescore (bool) – if True, rescore reassigned fingerprints
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
num_threads (int) – the number of threads to use, or -1 for the default
debug (0, 1, or 2) – an internal debug level for debugging
- Returns:
- chemfp.cdk2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'CDK-Daylight', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, hashPseudoAtoms: _Optional0or1 = None, pathLimit: _Optional[int] = None, perceiveStereochemistry: _Optional0or1 = None, searchDepth: _Optional[int] = None, size: _Optional[int] = None, implementation: _Optional[_typing.Literal['cdk', 'chemfp']] = None)¶
Use the CDK to convert a structure file or files to a fingerprint file
Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to the CDK). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in CDK- and format-specific configuration.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
Use type to specify the fingerprint type. This may be a short-hand name like “daylight”, a chemfp type string like “CDK-Daylight”, or a
FingerprintType
. Additional fingerprint-specific values may be passed as function call arguments.If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
type (a
FingerprintType
or string) – the fingerprint type to useinput_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
reader_args (a dictionary) – the reader arguments
id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
hashPseudoAtoms (if True, include pseudo-atoms in the hash calculation) – bool or None
pathLimit (maximum number of paths in path enumeration) – int or None
perceiveStereochemistry (if True, re-perceive stereochemistry) – bool or None
searchDepth (maximum path length) – int or None
size (the number of bits in the fingerprint) – int or None
implementation (if "chemfp", use chemfp's SMILES and SDF record readerinstead of cdk's built-in reader) – None, “cdk”, or “chemfp”
- Returns:
- chemfp.check_fingerprint_problems(query_fp: bytes, target_metadata: Metadata, query_name: str = 'query', target_name: str = 'target')¶
Return a list of compatibility problems between a fingerprint and a metadata
If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the query_fp byte string and the target_metadata then it will return a list containing a
ChemFPProblem
instance, with a severity level “error” and category “num_bytes mismatch”.This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:
>>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128)) >>> problems[0].description 'query contains 64 bytes but target has 128 byte fingerprints'
You can change the error message with the query_name and target_name parameters:
>>> import chemfp >>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128), ... query_name="input", target_name="database") >>> problems[0].description 'input contains 64 bytes but database has 128 byte fingerprints'
- Parameters:
query_fp (byte string) – a fingerprint (usually the query fingerprint)
target_metadata (
Metadata
instance) – the metadata to check against (usually the target metadata)query_name (string) – the text used to describe the fingerprint, in case of problem
target_name (string) – the text used to describe the metadata, in case of problem
- Returns:
a list of
ChemFPProblem
instances
- chemfp.check_metadata_problems(query_metadata: Metadata, target_metadata: Metadata, query_name: str = 'query', target_name: str = 'target')¶
Return a list of compatibility problems between two metadata instances.
If there are no probelms then this returns an empty list. Otherwise it returns a list of
ChemFPProblem
instances, with a severity level ranging from “info” to “error”.Bit length and byte length mismatches produce an “error”. Fingerprint type and aromaticity mismatches produce a “warning”. Software version mismatches produce an “info”.
This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:
>>> import chemfp >>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1") >>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1") >>> problems = chemfp.check_metadata_problems(m1, m2) >>> len(problems) 2 >>> print(problems[1].description) query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1'
You can change the error message with the query_name and target_name parameters:
>>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database") >>> print(problems[1].description) input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1'
- Parameters:
fp (byte string) – a fingerprint
metadata (a
Metadata
instance) – the metadata to check againstquery_name (string) – the text used to describe the fingerprint, in case of problem
target_name (string) – the text used to describe the metadata, in case of problem
- Returns:
a list of
ChemFPProblem
instances
- chemfp.convert2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr, input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', fingerprint_kwargs: _typing.OptionalFingerprintKwargs = None, id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True)¶
Convert a structure file or files to a fingerprint file.
This is the generic conversion function without the toolkit-specific keyword arguments of
rdkit2fps()
,cdk2fps()
,oe2fps()
orob2fps()
.Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in toolkit- and format-specific configuration.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
Use type to specify the fingerprint type. This can be a chemfp fingerprint type string or fingerprint type object. If it is a string then it is combined with fingerprint_kwargs to get the fingerprint type object.
If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
By default this will display progress bars while loading files and generating the array. Use progress=False to disable them, or a floating point value to not display a progress bar until the specified number of seconds.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
type (a
FingerprintType
or string) – the fingerprint type to useinput_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
reader_args (a dictionary) – the reader arguments
id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.count_tanimoto_hits(queries, targets, threshold: float = 0.7, arena_size: int = 100) Iterator[Tuple[str, int]] ¶
Count the number of targets within threshold of each query term
For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9): print(f"{query_id} has {count} neighbors with at least 0.9 similarity")
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.
Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.count_tanimoto_hits_fp()
orchemfp.search.count_tanimoto_hits_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (a positive integer, or None) – The number of queries to process in a batch
- Returns:
iterator of the (query_id, score) pairs, one for each query
- chemfp.count_tanimoto_hits_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7) _typing.IdAndCountIter ¶
Find the number of other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint_id, count) pairs.
Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6): print(f"{fp_id} has {count} neighbors with at least 0.6 similarity")
You may also be interested in
chemfp.search.count_tanimoto_hits_symmetric()
.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id, count) pairs, one for each fingerprint
- chemfp.count_tversky_hits(queries, targets, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, arena_size: int = 100) Iterator[Tuple[str, int]] ¶
Count the number of targets within threshold of each query term
For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tversky_hits( queries, targets, threshold=0.9, alpha=0.5, beta=0.5): print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity")
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.
Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.count_tversky_hits_fp()
orchemfp.search.count_tversky_hits_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (a positive integer, or None) – The number of queries to process in a batch
- Returns:
iterator of the (query_id, score) pairs, one for each query
- chemfp.count_tversky_hits_symmetric(fingerprints, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) Iterator[Tuple[str, int]] ¶
Find the number of other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint_id, count) pairs.
Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tversky_hits_symmetric( arena, threshold=0.6, alpha=0.5, beta=0.5): print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity")
You may also be interested in
chemfp.search.count_tversky_hits_symmetric()
.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id, count) pairs, one for each fingerprint
- chemfp.get_default_progressbar() None | Callable ¶
Return the current default progress bar, or None for the default behavior
- chemfp.get_fingerprint_families(toolkit_name=None) list[_typing.FingerprintFamily] ¶
Return a list of available fingerprint families
- Parameters:
toolkit_name (string) – restrict fingerprints to the named toolkit
- Returns:
a list of
chemfp.types.FingerprintFamily
instances
- chemfp.get_fingerprint_family(family_name: str) _typing.FingerprintFamily ¶
Return the named fingerprint family, or raise a ValueError if not available
Given a family_name like
OpenBabel-FP2
orOpenEye-MACCS166
return the correspondingchemfp.types.FingerprintFamily
.- Parameters:
family_name (string) – the family name
- Returns:
a
chemfp.types.FingerprintFamily
instance
- chemfp.get_fingerprint_family_names(include_unavailable: bool = False, toolkit_name: str | None = None) list[str] ¶
Return a set of fingerprint family name strings
The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings.
If include_unavailable is True then this will return a set of all of the fingerprint family names, including those which could not be loaded.
The set contains both the versioned and unversioned family names, so both
OpenBabel-FP2/1
andOpenBabel-FP2
may be returned.- Parameters:
include_unavailable (True or False) – Should unavailable family names be included in the result set?
- Returns:
a set of strings
- chemfp.get_fingerprint_type(type: str, fingerprint_kwargs: _typing.OptionalFingerprintKwargs = None) _typing.FingerprintType ¶
Get the fingerprint type based on its type string and optional keyword arguments
Given a fingerprint type string like
OpenBabel-FP2
, orRDKit-Fingerprint/1 fpSize=1024
, return the correspondingchemfp.types.FingerprintType
.The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the fingerprint_kwargs dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the fingerprint_kwargs takes precedence.
For example:
>>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'
Use
get_fingerprint_type_from_text_settings()
if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file.- Parameters:
type (string) – a fingerprint type string
fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
- Returns:
- chemfp.get_fingerprint_type_from_text_settings(type: str, settings: _Optional[_typing.TextSettingsType]) _typing.FingerprintType ¶
Get the fingerprint type based on its type string and optional settings arguments
Given a fingerprint type string like
OpenBabel-FP2
, orRDKit-Fingerprint/1 fpSize=1024
, return the correspondingchemfp.types.FingerprintType
.The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the settings dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the type string and the settings dictionary then the settings take precedence.
For example:
>>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3", ... {"fpSize": "4096"}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'
This function is for string settings from a configuration file or command-line. Use
get_fingerprint_type()
if your fingerprint parameters are Python values.- Parameters:
type (string) – a fingerprint type string
fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
- Returns:
- chemfp.get_num_threads() int ¶
Return the default number of OpenMP threads to use when num_threads is -1
Several chemfp functions are parallelized using OpenMP, and support a num_threads parameter to specify the number of OpenMP threads to use. If num_threads is -1 (the default) then chemfp uses the value of
chemfp.get_num_threads()
to get the actual number to use.This value can be set with
set_num_threads()
. If it has not been set, it defaults to the value of OpenMP’s omp_get_max_threads() (available in chemfp using fromget_omp_num_threads()
).The default value can be specified by the OMP_NUM_THREADS environment variable, and if that is also not set then the default value depends on the OpenMP implementation, and is likely based on the number of available cores.
Use chemfp’s
set_num_threads()
to set chemfp’s default value.The value returned is always a positive integer.
If OpenMP is not available then the number of threads is always 1.
- Returns:
the default number of OpenMP threads to use
- chemfp.get_omp_num_threads() int ¶
Return the number of threads OpenMP uses to create a team.
This function creates a new OpenMP team (with no num_threads clause) and reports the number of threads actually used.
Returns 1 if OpenMP is not available.
- chemfp.get_toolkit(toolkit_name: str) _typing.ToolkitType ¶
Return the named toolkit, if available, or raise a ValueError
If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” and the named toolkit is available, then it will return
chemfp.openbabel_toolkit
,chemfp.openeye_toolkit
, orchemfp.rdkit_toolkit
, respectively.:>>> import chemfp >>> chemfp.get_toolkit("openeye") <module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'> >>> chemfp.get_toolkit("rdkit") Traceback (most recent call last): ... ValueError: Unable to get toolkit 'rdkit': No module named rdkit
- Parameters:
toolkit_name (string) – the toolkit name
- Returns:
the chemfp toolkit
- Raises:
ValueError if toolkit_name is unknown or the toolkit does not exist
- chemfp.get_toolkit_names() set[str] ¶
Return a set of available toolkit names
The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:
>>> import chemfp >>> chemfp.get_toolkit_names() set(['openeye', 'rdkit', 'openbabel'])
- Returns:
a set of toolkit names, as strings
- chemfp.has_fingerprint_family(family_name: str) bool ¶
Test if the fingerprint family is available
Return True if the fingerprint family_name is available, otherwise False. The family_name may be versioned or unversioned, like “OpenBabel-FP2/1” or “OpenEye-MACCS166”.
- Parameters:
family_name (string) – the family name
- Returns:
True or False
- chemfp.has_toolkit(toolkit_name: str) bool ¶
Return True if the named toolkit is available, otherwise False
If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False.
>>> import chemfp >>> chemfp.has_toolkit("openeye") True >>> chemfp.has_toolkit("openbabel") False
The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached.
- Parameters:
toolkit_name (string) – the toolkit name
- Returns:
True or False
- chemfp.heapsweep(candidates: _typing.SourceOrArena, *, candidates_format: _OptionalStr = None, num_picks: int = 1, threshold: float = 1.0, all_equal: bool = False, randomize: bool = True, seed: int = -1, include_scores: bool = True, progress: _typing.ProgressbarOrBool = True)¶
Use the heapsweep algorithm to pick diverse fingerprints from candidates
The heapsweep algorithm picks fingerprints ordered by their respective maximum Tanimoto score to the rest of the arena, from smallest to largest. It uses a heap to keep track of the current score for each fingerprint (a lower bound to the global maximum score), and a flag specifying if the score is also the upper bound.
For each sweep, if the smallest heap entry is an upper bound, then pick it. Otherwise, find the similarity between the corresponding fingerprint and all other fingerprints in the arena. This sets the global maximum score for the heap entry, and may update the minimum score for the rest of the fingerprints. Update the heap and try again.
This process is repeated until num_picks fingerprints have been picked, or until maximum score for the remaining candidates is greater than threshold or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candidates.
If all_equal is True then additional fingerprints will be picked if they have the same score as pick num_pick.
The default num_picks = 1 and all_equal = False selects a fingerprint with the smallest maximum similarity. This is used as the initial pick for MaxMinPicker.from_candidates(). Use num_picks = 1 and all_equal = True to select all fingerprints with the smallest maximum similarity.
The fingerprints are selected from candidates. If it is not a
FingerprintArena
then the value is passed toload_fingerprints()
, along with values of candidates_format and progress to load the arena.If randomize is True (the default), the candidates are shuffled before the heapsweep algorithm starts. Shuffling should only affect the ordering of fingerprints with identical diversity scores. It is True by default so the first picked fingerprint is the same as MaxMin.from_candidates. Setting to False should generally be slightly faster.
The shuffle and heapsweep methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
The function returns a
HeapSweepInfo
object with information about what happened. Its picker attribute contains theHeapSweepPicker
used.If include_scores is true then its result attribute is a
PicksAndScores
instance, otherwise it is aPicks
.If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
- Parameters:
candidates (filename or
FingerprintArena
) – the candidate fingerprints for heapsweep pickingcandidates_format (str or None) – format for the candidates filename
num_picks (int) – the number of picks to do
threshold (float) – the maximum allowed Tanimoto similarity value
all_equal (bool) – if True, continue picking after num_picks if the pick similarity is the same
randomize (bool) – if True, shuffle before processing
seed (int) – the RNG seed, or -1 to have Python generate the seed
include_scores (bool) – if True, include the pick scores
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.knearest_tanimoto_search(queries, targets, k: int = 3, threshold: float = 0.0, arena_size: int = 100) Iterator[Tuple[str, List[Tuple[str, float]]]] ¶
Find the k-nearest targets within threshold of each query term
For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.
This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.
Example:
# Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8): print(f"{query_id} has {len(hits)} neighbors with at least 0.8 similarity") if hits: target_id, score = hits[-1] print(f" The least similar is {target_id} with score {score}")
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.knearest_tanimoto_search_fp()
orchemfp.search.knearest_tanimoto_search_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.k (positive integer) – The maximum number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (positive integer, or None) – The number of queries to process in a batch
- Returns:
An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.
- chemfp.knearest_tanimoto_search_symmetric(fingerprints: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0) _typing.IdAndSearchResultIter ¶
Find the k-nearest fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint id,
SearchResult
) pairs. Thechemfp.search.SearchResult
hits are ordered from highest score to lowest, with ties broken arbitrarily.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5): print(f"{fp_id} has {len(hits)} neighbors, with scores ", end="") print(", ".join("{x:.2f}" for x in hits.get_scores()))
You may also be interested in the
chemfp.search.knearest_tanimoto_search_symmetric()
function.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.k (positive integer) – The maximum number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id,
SearchResult
) pairs, one for each fingerprint
- chemfp.knearest_tversky_search(queries, targets, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0, arena_size: int = 100) Iterator[Tuple[str, List[Tuple[str, float]]]] ¶
Find the k-nearest targets within threshold of each query term
For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.
This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.
Example:
# Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tversky_search( queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") if hits: target_id, score = hits[-1] print(" The least similar is", target_id, "with score", score)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.knearest_tversky_search_fp()
orchemfp.search.knearest_tversky_search_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.k (positive integer) – The maximum number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (positive integer, or None) – The number of queries to process in a batch
- Returns:
An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.
- chemfp.knearest_tversky_search_symmetric(fingerprints: _typing.FingerprintArena, k: int = 3, threshold: float = 0.0, alpha: float = 1.0, beta: float = 1.0) _typing.IdAndSearchResultIter ¶
Find the k-nearest fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint,
SearchResult
) pairs. TheSearchResult
hits are ordered from highest score to lowest, with ties broken arbitrarily.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric( arena, k=5, threshold=0.5, alpha=0.5, beta=0.5): print(f"{fp_id} has {len(hits)} neighbors, with Dice scores ", end="") print(", ".join(f"{x:.2f}" for x in hits.get_scores()))
You may also be interested in the
chemfp.search.knearest_tversky_search_symmetric()
function.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.k (positive integer) – The maximum number of nearest neighbors to find.
threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id,
SearchResult
) pairs, one for each fingerprint
- chemfp.load_fingerprints(reader: _ReaderType, metadata: _typing.Optional[Metadata] = None, reorder: bool = True, alignment: _OptionalInt = None, format: _OptionalStr = None, allow_mmap: bool = True, *, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena ¶
Load all of the fingerprints into an in-memory FingerprintArena data structure
The function reads all of the fingerprints and identifers from reader and stores them into an in-memory
chemfp.arena.FingerprintArena
data structure which supports fast similarity searches.If reader is a string, the None object, or has a
read
attribute then it, the format, and allow_mmap will be passed to thechemfp.open()
function and the result used as the reader. If that returns a FingerprintArena then the reorder and alignment parameters are ignored and the arena returned.If reader is a FingerprintArena then the reorder and alignment parameters are ignored. If metadata is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is metadata.
Otherwise the reader or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena.
metadata specifies the metadata for all returned arenas. If not given the default comes from the source file or from
reader.metadata
.The loader may reorder the fingerprints for better search performance. To prevent ordering, use
reorder=False
. The reorder parameter is ignored if the reader is an arena or FPB file.The alignment option specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file.
The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.
- Parameters:
reader (a string, file object, or (id, fingerprint) iterator) – An iterator over (id, fingerprint) pairs
metadata (Metadata) – The metadata for the arena, if other than reader.metadata
reorder (True or False) – Specify if fingerprints should be reordered for better performance
alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
format (None, "fps", "fps.gz", "fps.zst", "fpb", "fpb.gz" or "fpb.zst") – The file format name if the reader is a string
allow_mmap (True or False) – Allow chemfp to use mmap on FPB files, instead of reading the file’s contents into memory
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.load_fingerprints_from_string(content: _typing.Content, format: str = 'fps', *, reorder: bool = True, alignment: _OptionalInt = None, progress: _typing.ProgressbarOrBool = False) _typing.FingerprintArena ¶
Load the fingerprints from the content string, in the given format
The supported format strings are:
“fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
“fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format
If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.
If the content is not in FPB format then by default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to
False
to preserve the input fingerprint order.If the content is not in FPB format then alignment specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None determines the best alignment based on the fingerprint size and available popcount methods.
The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.
- Parameters:
content (byte or text string) – The fingerprint data as a string.
format (string) – The file format and optional compression. Unicode strings may not be compressed.
reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.load_simarray(source: _typing.Source, *, format: _typing.OptionalSimarrayFormat = None, metadata_source: _typing.Optional[_typing.ExplicitSource], metadata_format: _typing.Literal['npy', None] = None, mmap_mode: _typing.Literal['default', None, 'r', 'r+', 'c'] = 'default') chemfp.simarray_io.SimarrayFileContent ¶
Load the simarray “npy” file or the “bin”+”npy” files
Read the array data from source with possible metadata in metadata_source. Use format and metadata_format to specify the respective formats rather than use the file extension or default value.
A “npy” file must contain three or four matricies, depending on the analysis type. The first array contains the comparision vector or array, the second array contains a JSON-encoded string describing the analysis type, the third contains the target ids (which are the array ids for an NxN symmetric analysis), and the fourth, if it exists, contains the query identifiers.
When using the “npy” format, if mmap_mode is “default” or None then the array will be loaded into memory. If “r”, “r+”, or “c” then it will be memory-mapped in read-only, read-write, or copy-on-write mode,
A “bin” file must contain the raw bytes for the comparison matrix. This requires a metadata source in “npy” format where the first matrix is used only to get its NumPy dtype. The resulting SimarrayContent combines the “bin” array with the metadata and ids from the metadata file.
When using the “bin” format, if mmap_mode is None then the array will be loaded into memory. If “r”, “r+”, or “c” then it will be memory-mapped in read-only, read-write, or copy-on-write mode, respectively. The default value of “default” uses “r”.
- Parameters:
source (None, a filename, or a file object) – the source containing the array values
format (str or None) – the source file format
metadata_source (None, a filename, or a file object) – the source of the array metadata
metadata_format (str or None) – the format of the array metadata source
mmap_mode (str or None) – read the data into memory or use a given memory map mode
- Returns:
- chemfp.maxmin(candidates: _typing.SourceOrArena, *, references: _Optional[_typing.SourceOrArena] = None, initial_pick: _typing.Union[None, int, str] = None, candidates_format: _OptionalStr = None, references_format: _OptionalStr = None, num_picks: int = 1000, threshold: float = 1.0, all_equal: bool = False, randomize: bool = True, seed: int = -1, include_scores: bool = True, progress: _typing.ProgressbarOrBool = True)¶
Use the MaxMin algorithm to pick diverse fingerprints from candidates
The MaxMin algorithm iteratively picks fingerprints from a set of candidates such that the newly picked fingerprint has the smallest Tanimoto similarity compared to any previously picked fingerprint, and optionally also the smallest Tanimoto similarity to the reference fingerprints.
This process is repeated until num_picks fingerprints have been picked, or until the remaining candidates are greater than threshold similar to the picked fingerprints, or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candiates, from most dissimilar to least. For example, to select all fingerprints with a maximum Tanimoto score of 0.2 then use num_picks = None and threshold = 0.2.
The fingerprints are selected from candidates. If it is not a
FingerprintArena
then the value is passed toload_fingerprints()
, along with values of candidates_format and progress to load the arena.If initial_pick and references are not specified then the initial pick is selected using the heapsweep algorithm, which finds a fingerprint with the smallest maximum Tanimoto to any other fingerprint. Use initial_pick to specify the initial pick, either as a string (which is treated as a candidate id) or as an integer (which is treated as a fingerprint index).
If references is not None then any picked candidate fingerprint must also be dissimilar from all of the fingerprints in the reference fingerprints. The model behind the terms is that you want to pick diverse fingerprints from a vendor catalog which are also diverse from your in-house reference compounds. If references is not a
FingerprintArena
then it is passed toload_fingerprints()
, along with the values of references_format and progress to load the arena.If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.
The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
The function returns a
BaseMaxMinSearch
object with information about what happened. Its out attribute contains theMaxMinPicker
used. If include_scores is true then its out attribute is aPicksAndScores
instance, otherwise it is aPicks
.If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
- Parameters:
candidates (filename or
FingerprintArena
) – the candidate fingerprints for sphere pickingreferences (None, filename or
FingerprintArena
) – candidates must also avoid these reference fingerprintsinitial_pick – the initial pick, as an index or id
candidates_format (str or None) – format for the candidates filename
references_format (str or None) – format for the references filename
num_picks (int) – the number of picks to pick
threshold (float) – the maximum similarity threshold
all_equal (bool) – if True, continue picking after num_picks if the pick similarity is the same
randomize (bool) – if True, shuffle before processing
seed (int) – the RNG seed, or -1 to have Python generate the seed
include_scores (bool) – if True, include pick scores
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.ob2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'OpenBabel-FP2', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, nBits: _OptionalInt = None)¶
Use Open Babel to convert a structure file or files to a fingerprint file
Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to Open Babel). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in Open Babel- and format-specific configuration.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
Use type to specify the fingerprint type. This may be a short-hand name like “FP2”, a chemfp type string like “OpenBabel-FP2”, or a chemfp type name. Additional fingerprint-specific values may be passed as function call arguments.
If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
type (a
FingerprintType
or string) – the fingerprint type to useinput_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
reader_args (a dictionary) – the reader arguments
id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
nBits (number of bits in the fingerprint) – int or None
- Returns:
- chemfp.oe2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'OpenEye-Path', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, atype: _Optional[_typing.Union[int, str]] = None, btype: _Optional[_typing.Union[int, str]] = None, maxbonds: _OptionalInt = None, maxradius: _OptionalInt = None, minbonds: _OptionalInt = None, minradius: _OptionalInt = None, numbits: _OptionalInt = None)¶
Use OEChem and OEGraphSim to convert a structure file or files to a fingerprint file
Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in OEChem- and format-specific configuration.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
Use type to specify the fingerprint type. This may be a short-hand name like “circular”, a chemfp type string like “OpenEye-Circular”, or a
FingerprintType
. Additional fingerprint-specific values may be passed as function call arguments.If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
type (a
FingerprintType
or string) – the fingerprint type to useinput_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
reader_args (a dictionary) – the reader arguments
id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
atype (specify the atom type invariants as bitflags) – integer or string
btype (specify the bond type invariants as bitflags) – integer or string
maxbonds (maximum number of bonds during path enumeration) – int or None
maxradius (maximum circular radius) – int or None
minbonds (minimum number of bonds during radius enumeration) – int or None
minradius (minimum circular radius) – int or None
numbits (number of bits in the fingerprint) – int or None
- Returns:
- chemfp.open(source: _typing.Source, format: _typing.Optional[str] = None, location: _typing.Optional[_typing.Location] = None, allow_mmap: bool = True) FingerprintReader ¶
Read fingerprints from a fingerprint file
Read fingerprints from source, using the given format. If source is a string then it is treated as a filename. If source is None then fingerprints are read from stdin. Otherwise, source must be a Python file object supporting the
read
andreadline
methods.If format is None then the fingerprint file format and compression type are derived from the source filename, or from the
name
attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in “fps” format.The supported format strings are:
“fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
“fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format
The optional location is a
chemfp.io.Location
instance. It will only be used if the source is in FPS format.If the source is in FPS format then
open
will return achemfp.fps_io.FPSReader
, which will use the location if specified.If the source is in FPB format then
open
will return achemfp.arena.FingerprintArena
and the location will not be used. If allow_mmap is True then chemfp may use mmap to read uncompressed FPB files. If False then chemfp will read the file’s contents into memory, which may give better performance if the FPB file is on a networked file system, at the expense of higher memory use.Here’s an example of printing the contents of the file:
from chemfp.bitops import hex_encode reader = chemfp.open("example.fps.gz") for id, fp in reader: print(id, hex_encode(fp))
- Parameters:
source (A filename string, a file object, or None) – The fingerprint source.
format (string, or None) – The file format and optional compression.
location (a
Location
instance, or None) – a location object used to access parser state informationallow_mmap (boolean) – if True, use mmap to open uncompressed FPB files, otherwise read the contents
- Returns:
- chemfp.open_fingerprint_writer(destination: _typing.Destination, metadata: _typing.Optional[Metadata] = None, format: _OptionalStr = None, *, alignment: int = 8, reorder: bool = True, level: _typing.CompressionLevel = None, include_metadata: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None) FingerprintWriter ¶
Create a fingerprint writer for the given destination
The fingerprint writer is an object with methods to write fingerprints to the given destination. The output format is based on the format. If that’s None then the format depends on the destination, or is “fps” if the attempts at format detection fail.
The metadata, if given, is a
Metadata
instance, and used to fill the header of an FPS file or META block of an FPB file.If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None for stdout. If the output format is “fpb” then destination must be a filename or seekable file object. A fingerprint writer with compressed FPB output is not supported; use arena.save() instead, or post-process the file.
Use level to change the compression level. The default is 9 for gzip and 3 for ztd. Use “min”, “default”, or “max” as aliases for the minimum, default, and maximum values for each range.
By default the metadata is included in the FPS output. Set include_metadata to False to disable writing the metadata.
Some options only apply to FPB output. The alignment specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to
False
to preserve the input fingerprint order.The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn’t enough available free memory. In that case, set max_spool_size to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.)
Use tmpdir to specify where to write the temporary spool files if you don’t want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables.
Some options only apply to FPS output. errors specifies how to handle recoverable write errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record. If include_metadata is false then the FPS metadata (the initial lines starting with ‘#’) are not included.
The location is a
Location
instance. It lets the caller access state information such as the number of records that have been written.- Parameters:
destination (a filename, file object, or None) – the output destination
metadata (a
Metadata
instance, or None) – the fingerprint metadataformat (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
alignment (positive integer) – arena byte alignment for FPB files
reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
level (an integer, the strings "min", "default" or "max", or None for default) – True reorders the fingerprints by popcount, False leaves them in input order
include_metadata (a boolean) – if True, include the header metadata in the FPS output
tmpdir (string or None) – the directory to use for temporary files, when max_spool_size is specified
max_spool_size (integer, or None) – number of bytes to store in memory before using a temporary file. If None, use memory for everything.
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorslocation (a
Location
instance, or None) – a location object used to access output state information
- Returns:
- chemfp.open_from_string(content: _typing.Content, format: _OptionalStr = 'fps', *, location: _typing.OptionalLocation = None) FingerprintReader ¶
Read fingerprints from a content string containing fingerprints in the given format
The supported format strings are:
“fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
“fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format
If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.
The optional location is a
chemfp.io.Location
instance. It will only be used if the source is in FPS format.- Parameters:
content (byte or text string) – The fingerprint data as a string.
format (string) – The file format and optional compression. Unicode strings may not be compressed.
location (a
Location
instance, or None) – a location object used to access parser state information
- Returns:
- chemfp.rdkit2fps(source: _typing.Source, destination: _typing.Destination, *, type: _typing.FingerprintTypeOrStr = 'RDKit-Morgan', input_format: _OptionalStr = None, output_format: _OptionalStr = None, reader_args: _typing.OptionalReaderArgs = None, id_tag: _OptionalStr = None, errors: _typing.ErrorsNames = 'ignore', id_prefix: _OptionalStr = None, id_template: _OptionalStr = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: _OptionalStr = None, max_spool_size: _OptionalInt = None, progress: _typing.ProgressbarOrBool = True, bitFlags: _OptionalInt = None, branchedPaths: _Optional0or1 = None, countBounds: _Optional[list[int]] = None, countSimulation: _Optional0or1 = None, fpSize: _OptionalInt = None, fromAtoms: _Optional[list[int]] = None, includeChirality: _Optional0or1 = None, includeRedundantEnvironments: _Optional0or1 = None, includeRingMembership: _Optional0or1 = None, isQuery: _Optional0or1 = None, isomeric: _Optional0or1 = None, kekulize: _Optional0or1 = None, maxDistance: _OptionalInt = None, maxLength: _OptionalInt = None, maxPath: _OptionalInt = None, minDistance: _OptionalInt = None, minLength: _OptionalInt = None, minPath: _OptionalInt = None, min_radius: _OptionalInt = None, nBitsPerEntry: _OptionalInt = None, nBitsPerHash: _OptionalInt = None, numBitsPerFeature: _OptionalInt = None, onlyShortestPaths: _Optional0or1 = None, radius: _OptionalInt = None, rings: _Optional0or1 = None, targetSize: _OptionalInt = None, torsionAtomCount: _OptionalInt = None, use2D: _Optional0or1 = None, useBondOrder: _Optional0or1 = None, useBondTypes: _Optional0or1 = None, useChirality: _Optional0or1 = None, useFeatures: _Optional0or1 = None, useHs: _Optional0or1 = None)¶
Use RDKit to convert a structure file or files to a fingerprint file
Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in RDKit- and format-specific configuration.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
Use type to specify the fingerprint type. This may be a short-hand name like “morgan”, a chemfp type string like “RDKit-Morgan”, or a
FingerprintType
. Additional fingerprint-specific values may be passed as function call arguments.If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle structure processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
type (a
FingerprintType
or string) – the fingerprint type to useinput_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
reader_args (a dictionary) – the reader arguments
id_tag (a string, or None to use the title) – if specified, get the id from the named SDF data tag
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
bitFlags (int or None) – the bitFlags (Avalon)
branchedPaths (bool or None) – if True, allow branched paths (RDKit-Fingerprint)
countBounds (list[int] or None) – a list of count bounds (AtomPair/3, RDKit-Fingerprint/3, RDKit-Morgan/2, RDKit-Torsion/4)
countSimulation (bool or None) – if True, use count simulation (AtomPair/3, RDKit-Fingerprint/3, RDKit-Morgan/2, RDKit-Torsion/4)
fpSize (int or None) – the number of bits in the fingerprint
fromAtoms (list[int] or None) – list of starting atom indices (AtomPair, RDKit-Fingerprint, Morgan, Torsion)
includeChirality (bool or None) – if True, include chirality (AtomPair, Morgan/2, Torsion)
includeRedundantEnvironments (bool or None) – if True, include the redundant environments (Morgan)
includeRingMembership (bool or None) – if True, include ring membership (Morgan/2)
isQuery (bool or None) – if True, treat as a query (Avalon)
isomeric (bool or None) – if True, use isomeric SMILES (SECFP)
kekulize (bool or None) – if True, use the Kekule SMILES (SECFP)
maxDistance (int or None) – the maximum distance between pairs (AtomPair/3)
maxLength (int or None) – the maximum distance between pairs (AtomPair/2)
maxPath (int or None) – the maximum path length (RDKit-Fingerprint)
minDistance (int or None) – the minimum distance between pairs (AtomPair/3)
minLength (int or None) – the minimum distance between pairs (AtomPair/2)
minPath (int or None) – the minimum path length (RDKit-Fingerprint)
min_radius (int or None) – the minimum radius (SEFP)
nBitsPerEntry (int or None) – the number of bits to set (AtomPair/2, Torsion/3)
nBitsPerHash (int or None) – the number of bits to set (RDKit-Fingerprint/2)
numBitsPerFeature (int or None) – the number of bits to set (RDKit-Fingerprint/3)
onlyShortestPaths (bool or None) – if True, only use shortest possible paths (Torsion/4)
radius (int or None) – circular radius (Morgan, SECFP)
rings (bool or None) – include ring information (SEFP)
targetSize (int or None) – number of atoms to use in the torsion (Torsion/3)
torsionAtomCount (int or None) – number of atoms to use in the torsion (Torsion/4)
use2D (bool or None) – if True, use 2D distance matrix, if False use first conformer (AtomPair)
useBondOrder (bool or None) – include bond order invariants (RDKit-Fingerprint)
useBondTypes (bool or None) – include bond type invariants (Morgan)
useChirality (bool or None) – include chirality invariants (Morgan/1)
useFeatures (bool or None) – if True, use chemical-feature invariants (Morgan)
useHs (bool or None) – if True, include information about the number of hydrogens (RDKit-Fingerprint)
- Returns:
- chemfp.read_molecule_fingerprints(type: str | Metadata, source: str | bytes | Path | None | BinaryIO = None, format: str | None = None, id_tag: str | None = None, reader_args: Dict[str, Any] | None = None, errors: Literal['strict', 'report', 'ignore'] = 'strict') FingerprintReader ¶
Read structures from source and return the corresponding ids and fingerprints
This returns an
chemfp.fps_io.FPSReader
which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of type. Structures are read from source, which can either be the structure filename, or None to read from stdin.type contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like
OpenBabel-FP2/1
,OpenEye-Path
, andOpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond
. Default values are used for unspecified parameters. Use aMetadata
instance with type and aromaticity values set in order to pass aromaticity information to OpenEye.If format is None then the structure file format and compression are determined by the filename’s extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise format may be “smi” or “sdf” optionally followed by “.gz” or “.bz2” to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats.
If id_tag is None, then the record id is based on the title field for the given format. If the input format is “sdf” then id_tag specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the “> <ChEBI ID>” line. In that case, use
id_tag = "ChEBI ID"
.The reader_args is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored.
errors specifies how to handle errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.
Here is an example of using fingerprints generated from structure file:
from chemfp.bitops import hex_encode fp_reader = chemfp.read_molecule_fingerprints( "OpenBabel-FP4/1", "example.sdf.gz") print("Each fingerprint has", fp_reader.metadata.num_bits, "bits") for (id, fp) in fp_reader: print(id, hex_encode(fp))
See also
chemfp.read_molecule_fingerprints_from_string()
.- Parameters:
type (string or
Metadata
) – information about how to convert the input structure into a fingerprintsource (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
format (string, or None to autodetect based on the source) – The file format and optional compression. Examples: “smi” and “sdf.gz”
id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Only valid for SD files. Example: “ChEBI ID”.
reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errors
- Returns:
- chemfp.read_molecule_fingerprints_from_string(type: str | Metadata, content: str | bytes, format: str, *, id_tag: str | None = None, reader_args: Dict[str, Any] | None = None, errors: Literal['strict', 'report', 'ignore'] = 'strict') FingerprintReader ¶
Read structures from the content string and return the corresponding ids and fingerprints
The parameters are identical to
chemfp.read_molecule_fingerprints()
except that the entire content is passed through as a content string, rather than as a source filename. See that function for details.You must specify the format! As there is no source filename, it’s not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content.
- Parameters:
type (string or Metadata) – information about how to convert the input structure into a fingerprint
content (string) – The structure data as a string.
format (string) – The file format and optional compression. Examples: “smi” and “sdf.gz”
id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errors
- Returns:
- chemfp.sdf2fps(source: str | bytes | Path | None | BinaryIO, destination: str | bytes | Path | None | BinaryIO, *, id_tag: str | None = None, fp_tag: str | None = None, input_format: str | None = None, output_format: str | None = None, metadata: Metadata | None = None, pubchem: bool = False, decoder: None | str | Callable[[str], tuple[int, bytes]] = None, errors: Literal['strict', 'report', 'ignore'] = 'report', id_prefix: str | None = None, id_template: str | None = None, id_cleanup: bool = True, overwrite: bool = True, reorder: bool = True, tmpdir: str | None = None, max_spool_size: int | None = None, progress: bool | float | int | None | Callable = True)¶
Extract and save fingerprints from tag data in an SD file
Use source to specify the input, which may be None for stdin, a file-like object, a filename, or a list of filenames. If input_format is not specified then the filename extension (if available) is used to determine the compression type, defaulting to uncompressed. Possible values for input_format include “sdf”, “sdf.gz”, and “sdf.zst”.
Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.
The id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier. The fp_tag specifies the tag containing the encoded fingerprint. The decoding describes how to decode the fingerprints. It may be one of “binary”, “binary-msb”, “hex”, “hex-lsb”, “hex-msb”, “base64”, “cactvs”, or “daylight”, or a callable object which takes the fingerprint string and returns the (number of bits, fingerprint byte string), or raises a ValueError on failures.
If id_cleanup is True then use the id up to any newline and remove any linefeed, tab, or NUL characters, as well as any leading or trailing spaces.
There are two options to synthesize a new identifier. Use id_prefix to specify a string prepended to the id, or use id_template to specify a string used a template. The template substitutions are:
{i}
(index starting from 1),{i0}
(index starting from 0),{recno}
(the current record number),{id}
(the original id),{clean_id}
the id after cleanup,{first_word}
(the first word of the first line), and{first_line}
(the first line).Handle record processing errors based on the value of errors, which may be one of “strict” (raise exception), “report” (send a message to stderr and continue processing), or “ignore” (continue processing) or an
chemfp.io.ErrorHandler
.If metadata is not None then it is used to generate the metadata output in the output file.
If pubchem is true and metadata is None, then a new Metadata will be used, with software as “CACTVS/unknown”, type as “CACTVS-E_SCREEN/1.0 extended=2”, num_bits as 881, and sources containing any source terms which are filenames.
The pubchem option also sets fp_tag to “PUBCHEM_CACTVS_SUBSKEYS” and decoder to “cactvs”, but only if those values aren’t otherwise specified.
If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.
If progress is True then use a progress bar to show the SDF processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
The values of reorder, tmpdir, max_spool_size are passed to
open_fingerprint_writer()
.This function returns a
ConversionInfo
instance with information about the conversion.- Parameters:
source (a filename, list of filenames, file object, or None for stdin) – the input source or sources for the SDF structures
destination (a filename, file object, or None for stdout) – the output for the fingerprints
id_tag (str or None) – use the named data item for the id, otherwise use the title
fp_tag (str or None) – use the named data item for the fingerprint, otherwise use the title
input_format (a string or None) – if specified, the source file format,
output_format (a string or None) – if specified, the destination file format,
metadata (a
Metadata
) – the metadata to use for the outputpubchem (bool) – if True, configure for processing a PubChem file
decoder (None, str, or Callable[bytes]->(int, bytes)) – a decoder name or callable to convert the fingerprint to a (num_bits, binary_fp) tuple
errors (one of “strict”, “report” or “ignore”, or an
ErrorHandler
) – specify how to handle parse errorsid_prefix (a string or None) – a string prepended to each id to create a new id
id_template (a string or None) – a string template used to create a new id
id_cleanup (bool) – if True, post-process the id to handle special characters
overwrite (bool) – if False do not process if the output file exists
reorder (bool) – if True and FPB output format, reorder the fingerprints by popcount
tmpdir (a string, or None to use Python's default) – the directory to use for temporary spool files
max_spool_size (integer number of bytes, or None) – if not None, the amount of memory to use before spooling
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
- chemfp.set_default_progressbar(progressbar: bool | Callable | None)¶
Configure the default progress bar
This must be an object implementing the tqdm class behavior or one of the following values:
False - do not use a progress bar
None or True - use the default progress bar
(False is mapped to the internal “disabled_tqdm” object.)
- chemfp.set_num_threads(num_threads: int)¶
Specify the default number of OpenMP threads that chemfp should use
Several chemfp functions are parallelized using OpenMP, and support a num_threads parameter to specify the number of OpenMP threads to use. If num_threads is -1 (the default) then chemfp uses the value of
get_num_threads()
to get the actual number to use.The set_num_threads function changes the default chemfp value to the specified value, if positive.
Use -1 to set to the default number to the value returned by
get_omp_num_threads()
, which is also chemfp’s initial value.Otherwise, if the value is 1 or smaller then chefmp’s default number of threads is set to 1.
- Parameters:
num_threads (int) – the new number of OpenMP threads to use
- chemfp.simarray(*, query_fp: _Optional[bytes] = None, queries: _Optional[_typing.ExplicitSourceOrArena] = None, query_format: _OptionalStr = None, targets: _typing.ExplicitSourceOrArena, target_format: _OptionalStr = None, metric: _typing.MetricNames = 'Tanimoto', as_distance: bool = False, include_lower_triangle: bool = True, out: _typing.OptionalNumPyArray = None, dtype: _typing.SimarrayDType = None, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1) chemfp.highlevel.simarray.SimarrayResult ¶
High-level API to generate a NumPy array containing the all-by-all comparisons
If targets is specified (and query_fp and queries are not) then generate the full NxN comparison matrix for all the fingerprints in queries. Set include_lower_triangle to False to leave the lower triangle as zeros (this is slightly faster than computing the full matrix).
If queries and targets are specified then generate the full NxM comparison matrix between all N fingerprints in the queries with the M fingerprints in the targets.
If query_fp and targets are specified then generate a vector of length N containing the comparison values between the query fingerprint (a byte string) and all N target fingerprints.
The number of fingerprint bits must be 2**15 or smaller. The Dice similarity does not support 2**15 fingerprint bits.
If queries or targets is a filename or file-like object then this function will use
load_fingerprints()
with the given query_format or target_format to read the file into a chemfp fingerprint arena. The fingerprint order will be preserved.The standard metrics (specified by metric), and their supported data types (specified by dtype with the default dtype listed first) are:
- “Tanimoto” = popcount(fp1 & fp2) / popcount(fp1 | fp2)
dtypes = [float64, float32, rational64, rational32, uint16]
- “Dice” = 2 * popcount(fp1 & fp2) / (popcount(fp1) + popcount(fp2))
dtypes = [float64, float32, rational64, rational32, uint16]
- “cosine” = popcount(fp1 & fp2) / (popcount(fp1) * popcount(fp2))
dtypes = [float64, float32, uint16]
- “Hamming” = popcount(fp1 ^ fp2)
dtypes = [uint16]
The rational64 and rational32 dtypes are two structured NumPy dtypes containing the numerator and denominator terms, as two uint32 and uint16 fields, respectively. These are not necessarily in reduced form (eg, it may store (2, 4) instead of (1, 2)).
The Tanimoto, Dice, and cosine “uint16” similarity scores are computed as floor(65535 * double_score), so 65535 means identity.
If as_distance is True then the Tanimoto, Dice, and cosine similarity scores are turned into a distance by computing 1-score.
The “Sheffield”, “Willett”, and “Daylight” store their results in the “abcd” dtype, which is a 4-element structure NumPy dtype with uint16 fields “a”, “b”, “c”, and “d”. The metric name specifies which convention to use:
- Sheffield:
“a” = popcount(fp1 & fp2) = the number of on-bits in common
“b” = the number of on-bits in fp1 which are off-bits in fp2
“c” = the number of on-bits in fp2 which are off-bits in fp1
“d” = the number of off-bits in fp1 which are also off-bits in fp2
- Willett:
“a” = popcount(fp1) = the number of on-bits in the first fingerprint
“b” = popcount(fp2) = the number of on-bits in the second fingerprint
“c” = popcount(fp1 & fp2) = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2
- Daylight (same as Sheffield with “a” and “c” swapped):
“a” = the number of on-bits in fp2 which are off-bits in fp1
“b” = the number of on-bits in fp1 which are off-bits in fp2
“c” = popcount(fp1 & fp2) = the number of on-bits in common
“d” = the number of off-bits in fp1 which are also off-bits in fp2
If out is None then this function creates a zeroed NumPy array to store the scores using the specified metric and dtype.
Otherwise, out must be a NumPy array (or view), with a dtype which is appropriate to the specified metric. (Note: only the field types much match, not the field names.) The number of rows and columns must be large enough for the number of query and target fingerprints.
By default this will display progress bars while loading files and generating the array. Use progress=False to disable them, or a floating point value to not display a progress bar until the specified number of seconds.
Use num_threads to specify the number of threads to use. The default value of -1 means to use the value returned by :func:get_num_threads().
This returns a
SimarrayResult
instance which can be used to access the query and target arenas, the output array, and any metadata, or to save the result in “npy” format.- Parameters:
query_fp (a byte string or None) – a query fingerprint for 1xN search
queries (None, a filename, file object, or a
FingerprintArena
) – the query fingerprintsquery_format (str or None) – the file format for the queries file
targets (None, a filename, file object, or a
FingerprintArena
) – the target fingerprintstarget_format (str or None) – the file format for the target file
metric (str) – the name of the metric to use
as_distance (bool) – if True, use a distance instead of a similarity
include_lower_triangle (bool) – if True, also compute the lower triangle
out (None or a NumPy array) – the NumPy array or view in which to save the results
dtype (None, str, or a NumPy dtype) – the specific data type to compute
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
num_threads (int) – the number of threads to use, or -1 for the default
- Returns:
- chemfp.simsearch(*, query: _OptionalStr = None, query_fp: _Optional[bytes] = None, query_id: _OptionalStr = None, queries: _Optional[_typing.ExplicitSourceOrArena] = None, query_format: _OptionalStr = None, type: _Optional[_typing.FingerprintTypeOrStr] = None, targets: _typing.Source, target_format: _OptionalStr = None, NxN: bool = False, k: _OptionalInt = None, threshold: _Optional[float] = None, alpha: _Optional[float] = None, beta: _Optional[float] = None, include_lower_triangle: bool = True, ordering: _Optional[_typing.OrderingNames] = None, progress: _typing.ProgressbarOrBool = True, num_threads: _typing.NumThreadsType = -1)¶
High-level API for similarity searches in targets.
Several different search types are supported:
If query_fp is a byte string then use it as the query fingerprint to search targets, create a
SearchResult
, and return aSingleQuerySimsearch
.If query_id is not None then get the corresponding fingerprint in targets (or raise a KeyError) and use it to search targets, create a
SearchResult
, and return aSingleQuerySimsearch
.If query is not None then parse it as a molecule record in query_format format (default: ‘smi’), create a
SearchResult
, and return aSingleQuerySimsearch
.If queries is not None, use it as queries for an NxM search of targets, create a
SearchResults
, and return aMultiQuerySimsearch
.If NxN is true then do an NxN search of the targets, create a
SearchResults
, and return aNxNSimsearch
.
The function returns a
BaseSimsearch
instance with information about what happened. Its out attribute stores theSearchResult
orSearchResults
.If queries or targets is not a fingerprint arena then use load_fingerprints() to load the arena. Use query_format or target_format to specify the format type.
If k is not None then do a k-nearest search, otherwise do a threshold search. If threshold is not None then the threshold is 0.0. If both are None the the defaults are k=3, threshold=0.0.
If alpha = beta = None or 1.0 then use a Tanimoto search, otherwise do a Tversky search with the given values of alpha and beta. If beta is not None then beta is set to alpha.
For NxN threshold search, if include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When False, only compute the upper triangle.
If ordering is not None then the hits will be reordered as specified. The available orderings are:
increasing-score - sort by increasing score
decreasing-score - sort by decreasing score
increasing-score-plus - sort by increasing score, break ties by increasing index
decreasing-score-plus - sort by decreasing score, break ties by increasing index
increasing-index - sort by increasing target index
decreasing-index - sort by decreasing target index
move-closest-first - move the hit with the highest score to the first position
reverse - reverse the current ordering
If progress is True then use a progress bar to show FPS load progress, and NxN and NxM search progress. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar.
Use num_threads to specify the number of threads to use. The default of -1 means to use the value of
chemfp.get_num_threads()
, otherwise it must be a positive integer.- Parameters:
query (a string or None) – a query structure record
query_fp (a byte string or None) – a query fingerprint
query_id (str or None) – use the corresponding targets fingerprint as a query fingerprint
queries (a filename, file object, or
FingerprintArena
) – the query fingerprintsquery_format (str or None) – the file format for the query file
type (a string,
FingerprintType
, or None) – the fingerprint type used to convert a query to a fingerprinttargets (a filename, file object, or
FingerprintArena
, or None) – the target fingerprintstarget_format (str or None) – the file foramt for the target file
NxN (bool) – if True, use the targets to search itself
k (int or None) – the number of nearest neighbors to find
threshold (float or None) – the minimum similarity threshold
alpha (float) – the Tversky alpha value
beta (float) – the Tversky alpha value
include_lower_triangle (bool) – if True and an NxN search, also include the lower triangle
ordering (str or None) – the expected output ordering
progress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
num_threads (int) – the number of threads to use, or -1 for the default
- Returns:
a
SingleQuerySimsearch
,MultiQuerySimsearch
, orNxNSimsearch
- chemfp.spherex(candidates: _typing.SourceOrArena, *, references: _Optional[_typing.ExplicitSourceOrArena] = None, initial_picks: _Optional[_typing.Union[int, str, list[int], list[str]]] = None, candidates_format: _OptionalStr = None, references_format: _OptionalStr = None, num_picks: int = 1000, threshold: float = 0.4, ranks: _Optional[int] = None, dise: bool = False, dise_type: _Optional[_typing.FingerprintTypeOrStr] = None, dise_references: _Optional[_typing.ExplicitSource] = None, dise_references_format: _OptionalStr = None, randomize: _typing.Literal[None, True, False] = None, seed: int = -1, num_threads: _typing.NumThreadsType = -1, include_counts: bool = False, include_neighbors: bool = False, progress: _typing.ProgressbarOrBool = True)¶
Use sphere picking to select diverse fingerprints from candidates
Sphere picking iteratively picks a fingerprint from a set of candidates such that the fingerprint is not at least threshold similar to any previously picked fingerprint. The process is repeated until num_picks fingerprints are selected or no pickable fingerprints are available.
Several varations of “picks a fingerprint” are supported. If directed sphere exclusion is NOT used, then:
1) The default (randomize = None), or if randomize = True, select the next available candidate at random.
2) If default = False, select the next candidate which has the smallest index in the arena. This biases the picks towards fingerprints with the fewer number of bits set, which are likely fingerprints with lower complexity. It doesn’t appear to be that useful.
Directed sphere exclusion (see the DISE paper by Gobbi and Lee), requires a rank for each fingerprint. The next pick is chosen from one of the fingerprints with the smallest rank. There are three ways to specify the ranks:
A) They can be passed in directly as the ranks array, which must be a list of integers between 0 and 2**64-1.
B) If dise is True then the structures from the DISE paper are used. This requires a chemistry toolkit to generate the reference fingerprints. Use dise_type to specify the fingerprint type to use instead of the one from the candidates.
C) The reference fingerprints for the DISE algorithm may be passed as dise_references. This may be an arena or a fingerprint filename. Use dise_references_format to specify the file format instead of using the extension.
If initial ranks are specified, then there are two additional ways to pick a fingerprint:
3) The default (randomize = None), or if randomize = False, selects the the candidate with the smallest rank, breaking ties by selecting the candidate with the smallest index in the arena.
4) If randomize = True, select randomly from all of the candidates with the smallest rank. NOTE: this method uses a linear search, which may cause quadratic behavior if many fingerprints have the same rank.
The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to
load_fingerprints()
, along with values of candidates_format and progress to load the arena.If references is not None then any candidate fingerprints which are at least threshold similar to the reference fingerprints are removed before picking starts. If references is not a FingerprintArena then the value is passed to
load_fingerprints()
, along with the values of references_format and progress to load the arena.If references is not specified then optionally use initial_picks to specify the initial picks. This may be a candidate id string or integer index into the candidate array, or a list of id strings or integer indices. The list may be in any order and may contain duplicates. (The neighbor sphere will be empty for any duplicates.)
Initial picks are not necessary. If initial_picks is None then the specified picking method is used.
Some of the pick methods use a random number generator, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Sphere picking in the candidates may be multi-threaded. The default num_threads of -1 uses chemfp.get_num_threads() threads, which depends on the number of CPU cores in your system and is likely too small. My test suggest 30 threads or higher is more effective. The values of 0 and 1 both mean single-threaded.
The function returns a
BaseSpherexSearch
object with processing information. The picker attribute is theSphereExclusionPicker
used. By default the result element is aSpherexSearch
instance. If include_counts is true then it is theSpherexCountSearch
returned from calling the picker’spick_n_with_counts()
. If include_neighbors is True then the result is theSpherexNeighborSearch
returned from callingpick_n_with_neighbors()
. include_counts and include_neighbors cannot both be true.If progress is True then a progress bar will be used to show any FPS file load progress. If False then no progress bar is used. If a float or int then the number of seconds to delay before showing a progress bar. It may also be a callable used to create the progress bar. The sphere picker search does not currently support progress bars.
- Parameters:
candidates (filename or
FingerprintArena
) – the candidate fingerprints for sphere pickingreferences (None, filename or
FingerprintArena
) – candidates must not be near these reference fingerprintsinitial_picks (None, str, int, list[str], or list[int]) – the initial sphere centers, as indices or ids
candidates_format (str or None) – format for the candidates filename
references_format (str or None) – format for the references filename
num_picks (int) – the number of picks to pick
threshold (float) – the maximum sphere exclusion similarity threshold
ranks (list[int] or None) – ranking values for the candidates, lowest ranks picked first
dise (bool) – if True, use directed sphere exclusion
dise_type (None, a string, or a
FingerprintType
) – specify the fingerprint type to convert the DISE reference structures to fingerprintsdise_references (a filename or file object) – the DISE structure or fingerprint source
dise_references_format (str) – the structure or fingerprint format
randomize (None or bool) – specify how to select the next candidate
seed (int) – specify the initial RNG seed, or -1 to have Python generate the seed
num_threads (int) – the number of threads to use, or -1 for the default
include_counts (bool) – if True, return a
SpherexCountSearch
with sphere countsinclude_neighbors (bool) – if True, return a
SpherexNeighborSearch
with sphere neighborsprogress (bool, int, float, or callable) – True/False to have a progress bar, delay time before showing a progress bar, or a tqdm-like progress bar constructor
- Returns:
a
SpherexSearch
,SpherexCountSearch
, orSpherexNeighborSearch
- chemfp.threshold_tanimoto_search(queries, targets, threshold: float = 0.7, arena_size: int = 100) Iterator[Tuple[str, List[Tuple[str, float]]]] ¶
Find all targets within threshold of each query term
For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8): print(f"{query_id} has {len(hits)} neighbors with at least 0.8 similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.threshold_tanimoto_search_fp()
orchemfp.search.threshold_tanimoto_search_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (positive integer, or None) – The number of queries to process in a batch
- Returns:
An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.
- chemfp.threshold_tanimoto_search_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7) _typing.IdAndSearchResultIter ¶
Find the other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint id,
SearchResult
) pairs. TheSearchResult
hit order is arbitrary.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75): print(f"{fp_id} has {len(hits)} neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(f" {other_id} {score:.2f}")
You may also be interested in the
chemfp.search.threshold_tanimoto_search_symmetric()
function.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id,
SearchResult
) pairs, one for each fingerprint
- chemfp.threshold_tversky_search(queries, targets, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0, arena_size: int = 100) Iterator[Tuple[str, List[Tuple[str, float]]]] ¶
Find all targets within threshold of each query term
For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search( queries, targets, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.threshold_tversky_search_fp()
orchemfp.search.threshold_tversky_search_arena()
.- Parameters:
queries (any fingerprint container) – The query fingerprints.
targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
arena_size (positive integer, or None) – The number of queries to process in a batch
- Returns:
An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.
- chemfp.threshold_tversky_search_symmetric(fingerprints: _typing.FingerprintArena, threshold: float = 0.7, alpha: float = 1.0, beta: float = 1.0) _typing.IdAndSearchResultIter ¶
Find the other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint id,
SearchResult
) pairs. TheSearchResult
hit order is arbitrary.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric( arena, threshold=0.75, alpha=0.5, beta=0.5): print(f"{fp_id} has {len(hits)} Dice neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(f" {other_id} {score:.2f}")
You may also be interested in the
chemfp.search.threshold_tversky_search_symmetric()
function.- Parameters:
fingerprints (a
FingerprintArena
with precomputed popcount_indices) – The arena containing the fingerprints.threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- Returns:
An iterator of (fp_id,
SearchResult
) pairs, one for each fingerprint