chemfp 5.0b1 available for beta testing

Chemfp 5.0b1 is available for beta testing. Source licensee should have received a download link by email. Binary licensees and those who want to evaluate this release can install it with:

python -m pip install chemfp==5.0b1 -i https://chemfp.com/packages/

Below is a short summary of the changes. What remains is final testing, feedback from beta users, and documentation.

shard search

Instead of using a single large file, a shard search breaks the search up into smaller smaller files, then merging the results. This is useful if the dataset is too large to fit into memory, or if file bandwidth is low.

The new "chemfp shardsearch" command does a similarity search across multiple shards. For example, a dataset with 1 billion 1024-bit fingerprints is too large to fit into 32 GB of RAM. Chemfp will memory-map the file, but the data must still get from the disk into memory.

If the data were in RAM, the search would finish in seconds. On my 34 GB desktop with a slow hard disk a simple wc -l takes 20 minutes and a similarity search takes 10 minutes. By using Zstandard compression of multiple shards, at maximum normal compression, the search takes about 3 minutes. This is because compression decreases the amount of data transferred, Zstandard is very fast to decompress, and by using shards only part of the data set is in memory at any one time.

similarity histogram

The new chemfp simhistogram or chemfp simhist command creates a histogram of similarity scores either between different pairs of fingerprints in one dataset, or with pairs between two different datasets.

This histogram can be generated from all possible pairs, or from a randomly sampled subsets.

This is also available from the API as chemfp.simhistogram().

large FPB files

Previously the FPB format could only support about 250 million fingerprints. This release handles at least 1 billion, and should support up to 2 billion in a single file.

Klekota-Roth fingerprint implementations

Klekota and Roth in "Chemical substructures that enrich for biological activity", doi:10.1093/bioinformatics/btn479 list 4860 substructures related to biological activity.

I have implemented these fingerprints for both RDKit and OpenEye toolkits, available on the command-line using rdkit2fps --KlekotaRoth and oe2fps --KlekotaRoth or through the fingerprint type names KlekotaRoth-RDKit and KlekotaRoth-OpenEye, respectively.

CDK changes

Added support for CDK 2.10 and 2.11. Several fingerprint types now have a new "hashExplicitHydrogens" flag.

By default the structure readers now "prepare" the input structures by perceiving rings and (Daylight) aromaticity. Use the reader parameter prepare=0 to disable this option.

Python and NumPy support.

Added support for NumPy 2.0+ and Python 3.12, removed support for Python 3.8

API removal

As mentioned in the previous release notes, I have replaced parts of the API in favor of a new API:

instead of toolkit from_smi, to_sdf, etc. use parse_smi, create_smi, etc.
instead of bitops byte_difference and hex_difference use byte_xor and hex_xor.

There are also performance improvements and bug fixes.

If any of this sounds interesting and you would like more details or an evaluation license then do not hesitate to contact me.