I'm pleased to announce the release of chemfp 4.0.

The two themes of this release are diversity selection and improving the user-interface for interactive use.

Diversity selection

Chemfp 4.0 adds implementations of the MaxMin, heapsweep, and sphere exclusion algorithms for diversity selection.

MaxMin is an approximate iterative method to identify the most dissimilar fingerprints in a dataset without first computing all pairs of similarities. It can also be used for reference-based selection, where the picks must also be dissimilar to a set of reference fingerprints. For example, picking diverse fingerprints from a vendor catalog which are also diverse from an in-house collection.

Heapsweep is an exact iterative method to identify the most dissimilar fingerprints in a dataset without first computing all pairs of similarities. It is about 50x slower than MaxMin. It is primarily used to seed MaxMin with with the globally most diverse fingerprint, if no initial pick is given.

Sphere exclusion used to modify random sampling to prevent picking fingerprints which are similar to previous picks. Chemfp also supports directed sphere exclusion (DISE) using ranked picks. Fingerprints might be ranked by similarity to a set of reference compounds or by a user-defined list of values.

These algorithms are available using the new chemfp command, using the subcommand chemfp maxmin, chemfp heapsweep, and chemfp spherex, and through the chemfp API as chemfp.maxmin, chemfp.heapsweep and chemfp.spherex.

Improved user-interface

Chemfp's command-line tools build on the chemfp Python API so in principle people could generate fingerprint files or do similarity searches directly with the API. In practice, the API was too low-level, and even in the notebook it was easier to ! shell out chemfp's command-line tools.

There are new "high-level" functions which are similar to the command-line tools. Rather than using !simsearch, use chemfp.simsearch(). Rather than using !rdkit2fps, use chemfp.rdkit2fps, and so on.

Chemfp 4.0 add pandas integration. The similarity search and diversity selection results have a to_pandas method which exports the data to a Pandas dataframe.

One calming additition is progress bars, based on tqdm, which are used in most places which are likely to take a long time. (The default progress bar can be disabled through the API or by setting the CHEMFP_PROGRESSBAR environment variable to "0".)

There are new helper methods to make it easier to work with toolkit-specific formats and fingerprint types, like using chemfp.rdkit.morgan.from_smiles("CO") to convert a SMILES string into an RDKit Morgan fingerprint.

Finally, the simsearch and diversity selection command-line tools support CSV and TSV output, specified through the --out option or based on the output filename extension.