I'm pleased to announce the release of chemfp 4.0.
The two themes of this release are diversity selection and improving the user-interface for interactive use.
Diversity selection
Chemfp 4.0 adds implementations of the MaxMin, heapsweep, and sphere exclusion algorithms for diversity selection.
MaxMin is an approximate iterative method to identify the most dissimilar fingerprints in a dataset without first computing all pairs of similarities. It can also be used for reference-based selection, where the picks must also be dissimilar to a set of reference fingerprints. For example, picking diverse fingerprints from a vendor catalog which are also diverse from an in-house collection.
Heapsweep is an exact iterative method to identify the most dissimilar fingerprints in a dataset without first computing all pairs of similarities. It is about 50x slower than MaxMin. It is primarily used to seed MaxMin with with the globally most diverse fingerprint, if no initial pick is given.
Sphere exclusion used to modify random sampling to prevent picking fingerprints which are similar to previous picks. Chemfp also supports directed sphere exclusion (DISE) using ranked picks. Fingerprints might be ranked by similarity to a set of reference compounds or by a user-defined list of values.
These algorithms are available using the new chemfp
command, using
the subcommand chemfp maxmin
, chemfp heapsweep
, and chemfp
spherex
, and through the chemfp API as chemfp.maxmin
,
chemfp.heapsweep
and chemfp.spherex
.
Improved user-interface
Chemfp's command-line tools build on the chemfp Python API so in
principle people could generate fingerprint files or do similarity
searches directly with the API. In practice, the API was too
low-level, and even in the notebook it was easier to !
shell out
chemfp's command-line tools.
There are new "high-level" functions which are similar to the
command-line tools. Rather than using !simsearch
, use
chemfp.simsearch()
. Rather than using !rdkit2fps
, use
chemfp.rdkit2fps
, and so on.
Chemfp 4.0 add pandas integration. The similarity search and diversity
selection results have a to_pandas
method which exports the data to
a Pandas dataframe.
One calming additition is progress bars, based on tqdm, which are used in most places which are likely to take a long time. (The default progress bar can be disabled through the API or by setting the CHEMFP_PROGRESSBAR environment variable to "0".)
There are new helper methods to make it easier to work with
toolkit-specific formats and fingerprint types, like using
chemfp.rdkit.morgan.from_smiles("CO")
to convert a SMILES string
into an RDKit Morgan fingerprint.
Finally, the simsearch and diversity selection command-line tools
support CSV and TSV output, specified through the --out
option or
based on the output filename extension.