chemfp 4.2 is available

After over a year of development, I'm pleased to announce the release of chemfp 4.2. Here are highlights. For full details read What’s new in chemfp 4.2 from the documentation.

simarray

The biggest addition is "simarray", which computes the entire comparison matrix as a NumPy array.

By default it computes Tanimoto similarities using 64-bit floats, for best compatibility with Python, but if space is important you can use 32-bit floats, or even scaled 16-bit integers. It also supports Dice and cosine similarity, as well as the Hamming distance (as 16-bit integers), and there's an option to the similarity values into distances, computed as 1-similarity.

If you prefer to work with integer values, the Tanimoto and Dice scores can be computed as pairs of 16-bit or 32-bit integers, and if you need to write a specialized comparison function, you can have simarray compute the four "a", "b", "c", and "d" components often used in the cheminformatics literature to describe those functions.

Chemfp can save the output array to a file NumPy's "npy" format, in a way which is both compatible with NumPy and which preserves array metadata (lke the metric used) and identifiers.

Simarray processes over 100 million comparisons on my laptop, so even a 100Kx100K matrix takes under two minutes to generate, though it would require 75 GB of RAM using 64-bit floats.

Bear in mind that you will need a license key or source license for this scenario as the base license agreement limits you to 20Kx20K arrays.

If you need a really large array and don't have that much memory, use the new "chemfp simarray" subcommand to save the output in "raw" format. This processes 2GB of array data at a time, then writes the result to disk. By my estimate you could generate the full NxN comparison of ChEMBL in about a day, using a 31 TB file to store the 32-bit fingerprints.

I'm looking forward to someone putting this to the test!

For more details, see the Simarray section of "What's new in 4.2."

Updated RDKit fingerprint types

Back in 2020 the RDKit toolkit started migrating fingerprint generation from a function-based API to a more object-oriented API. The developers used this as an opportunity to change some of the fingerprint parameter names and implementation details, such as support for "count simulation", which captures some count information in a binary fingerprint. It currently supports both APIs, but will be removing the old API in the future.

With this release, chemfp adds support for the new API while still supporting the old API. Existing datasets will still work without a problem, because they have a type string which tells chemfp which version to use.

However, if you ask for the RDKit-Fingerprint (RDKit's Daylight-like fingerprints), the RDKit-Morgan, the RDKit-AtomPair, or RDKit-Torsion fingerprints, and if your RDKit toolkit is less than a few years old, then chemfp will use the new API instead of the old. This means the fingerprint type strings will be different, and the fingerprint bits will also likely be different.

IMPORTANT - the Morgan fingerprints now default to a radius of 3, to match RDKit's default, instead of using a radius of 2.

Chemfp will issue a warning to stderr if the two fingerprint sets have different versions, which should help identify any problems.

If you want to use the old fingerprint types, you can specify them in the type string, or on the rdkit2fps command-line, or by using the right fingerprint type object in the chemfp.rdkit_toolkit package.

For more details, see the RDKit fingerprint generation section of "What's new in 4.2."

jCompoundMapper

Earlier in 2024 I read "Effectiveness of molecular fingerprints for exploring the chemical space of natural products" by Boldini, Ballabio, Consonni, Todeschini, Grisoni, and Sieber, J. Cheminform. (2024) 16:35, which suggested a couple of fingerprint types from "jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints" by Hinselmann, Rosenbaum, Jahn, Fechner, and Zell, J. Cheminform. 3, 3 (2011) might be useful in the natural products space.

I decided it would be useful to include support for some of the jCompoundMapper fingerprint types as part of this release, which required help from Davide Boldini, plus help from John Mayfield to get the 13-year-old jCompoundMapper jar to work with a modern CDK.

It requires starting CDK in a special backwards compatibility mode, but I am still very impressed that it's possible!

For more details, see the jCompoundMapper section of "What's new in 4.2."

Updated CDK interface

Several of the CDK fingerprint types require explicit hydrogens. With this release I've added a new "hydrogens" reader argument to the CDK-based SMILES and SDF readers. The value "make-explicit" converts all implicit hydrogens to explicit, "make-implicit" converts all explicit hydrogens to implicit, and "make-nonchiral-implicit" converts all non-chiral hydrogens to implicit.

Chemfp also now uses "CDK-Pubchem/2.9" as the version number for the improved Pubchem fingerprint implementation in CDK 2.9.

For more details, see the CDK updates section of "What's new in 4.2."

Modern Python

This release adds support for Python 3.12. This will be the last chemfp release to support Python 3.8.

I've migratated installation from the "legacy" system based on setup.py to the PEP 517 approach using pyproject.toml.

I've started adding type annotations. This will take a few releases to fully flesh out.

For more details, see the Modern Python section of "What's new in 4.2."