Chemfp 5.0b1 is available for beta testing. Source licensee should have received a download link by email. Binary licensees and those who want to evaluate this release can install it with:
python -m pip install chemfp==5.0b1 -i https://chemfp.com/packages/
Below is a short summary of the changes. What remains is final testing, feedback from beta users, and documentation.
shard search
Instead of using a single large file, a shard search breaks the search up into smaller smaller files, then merging the results. This is useful if the dataset is too large to fit into memory, or if file bandwidth is low.
The new "chemfp shardsearch" command does a similarity search across multiple shards. For example, a dataset with 1 billion 1024-bit fingerprints is too large to fit into 32 GB of RAM. Chemfp will memory-map the file, but the data must still get from the disk into memory.
If the data were in RAM, the search would finish in seconds. On my 34
GB desktop with a slow hard disk a simple wc -l
takes 20 minutes and
a similarity search takes 10 minutes. By using Zstandard compression
of multiple shards, at maximum normal compression, the search takes
about 3 minutes. This is because compression decreases the amount of
data transferred, Zstandard is very fast to decompress, and by using
shards only part of the data set is in memory at any one time.
similarity histogram
The new chemfp simhistogram
or chemfp simhist
command creates a
histogram of similarity scores either between different pairs of
fingerprints in one dataset, or with pairs between two different
datasets.
This histogram can be generated from all possible pairs, or from a randomly sampled subsets.
This is also available from the API as chemfp.simhistogram()
.
large FPB files
Previously the FPB format could only support about 250 million fingerprints. This release handles at least 1 billion, and should support up to 2 billion in a single file.
Klekota-Roth fingerprint implementations
Klekota and Roth in "Chemical substructures that enrich for biological activity", doi:10.1093/bioinformatics/btn479 list 4860 substructures related to biological activity.
I have implemented these fingerprints for both RDKit and OpenEye
toolkits, available on the command-line using rdkit2fps
--KlekotaRoth
and oe2fps --KlekotaRoth
or through the fingerprint
type names KlekotaRoth-RDKit
and KlekotaRoth-OpenEye
,
respectively.
CDK changes
Added support for CDK 2.10 and 2.11. Several fingerprint types now have a new "hashExplicitHydrogens" flag.
By default the structure readers now "prepare" the input structures by
perceiving rings and (Daylight) aromaticity. Use the reader parameter
prepare=0
to disable this option.
Python and NumPy support.
Added support for NumPy 2.0+ and Python 3.12, removed support for Python 3.8
API removal
As mentioned in the previous release notes, I have replaced parts of the API in favor of a new API:
-
instead of toolkit
from_smi
,to_sdf
, etc. useparse_smi
,create_smi
, etc. -
instead of bitops
byte_difference
andhex_difference
usebyte_xor
andhex_xor
.
There are also performance improvements and bug fixes.
If any of this sounds interesting and you would like more details or an evaluation license then do not hesitate to contact me.