chemfp 4.2 documentation

chemfp is the package you’ve been looking for, if you work with binary cheminformatics fingerprints in Python.

Chemfp is perhaps best known for its high-performance fingerprint similarity search. Its Taylor/Butina clustering, MaxMin diversity selection, and sphere exclusion, (including directed sphere exclusion) are equally world-class, and in some cases more than an order of magnitude faster than its competitors. Or, if you simply need a 100K by 100K distance array to pass into scikit-learn, chemfp’s simarray can generate that in less than a minute.

All of that functionality is available using command-line tools or, for those who need more customization, through a comprehensive Python API. Research scientists and IT developers will both enjoy chemfp’s extensive integration with NumPy, SciPy, and Pandas.

Do you want to evaluate the effectiveness of different fingerprint types? No other system has built-in support for RDKit, OEChem/OEGraphSim, CDK, Open Babel, and jCompoundMapper fingerprints, or you can import your own fingerprints using the text-based FPS format. Chemfp even includes tools to extract fingerprints from SDF tag data. or extract or process fields from CSV columns.

Do you want to develop your own analysis methods? Let chemfp handle fingerprint generation and file I/O and give you a NumPy view of the fingerprint data. You can also build on chemfp’s own components, including its cross-toolkit wrapper API for molecule I/O and its text toolkit for processing SDF and SMILES files directly as text records.

At its heart, chemfp is a Python library, designed to be a component in a larger system. Do you need to add similarity search to a Django component? Create a Jupyter widget? Write a desktop application GUI with PyQt? If you can “import chemfp” then all those are possible.

Summary and recent release details

chemfp is a set of command-line tools and a Python package for working with binary cheminformatics fingerprints, typically between several hundred and a few thousand bits long.

Chemfp 4.2 was released on 10 July 2024 with simarray support to generate an all-by-all NumPy array, updates to the latest RDKit fingerprint generators, including count simulation, and jCompoundMapper support. See What’s new in chemfp 4.2 for details.

It was tested with Python 3.9-3.12. It requires the “click” and “tqdm” third-party packages, which should be installed automatically as part of the normal installation process. Some optional features will only work if they are installed by other methods, like the NumPy, SciPy, and Pandas integration.

Chemfp 4.1 was released on 17 May 2023. with CXSMILES support, methods to save and load similarity search results to a SciPy sparse matrix, Butina clustering, methods to work with CSV files, and a tool to convert between structure formats. See What’s new in chemfp 4.1 for details.

Chemfp 4.0 added new methods for diversity selection and improves API usability with new high-level functions and improved feedback for interactive use (including progress bars!). See What’s new in chemfp 4.0 for details.

Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. The supported toolkits are OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter). jCompoundMapper requires the CDK jar on the CLASSPATH folllowed by either “jCMapperCLI.jar” or “jCMapperLibOnly.jar”, available from its Sourceforge distribution.

Table of Contents

Background

Chemfp started because around 2007 a project I worked on needed a way to include nearest-neighbor information for a property prediction calculator. The cheminformatics toolkits at the time didn’t include that as a built-in tool, though they did supply the components to build your own. Indeed, asking showed that nearly everyone had built their own similar sorts of tools, each with a different format, and varying levels of performance that were nowhere near the hardware limits.

The first step was to develop the FPS format, a human-readable text-based exchange format for fingerprint data that is easy for software to read and write. It stores fingerprint records containing a hex-encoded fingerprint and a record identifier, as well as metadata like the associated fingerprint type.

People don’t use an alternate format just because it exists, so the next step was to develop useful command-line tools for fingerprint generation and similarity search, as well as a Python API for working with fingerprints in a discovery setting - like adding similarity search to a web app! Alternatively, the sdf2fps tool can extract fingerprint data from an SDF tag field.

People don’t use alternative tools just because they exist, so the third step was improve similarity search performance. This was done by improving the search algorithm and implementation and adding multi-threaded support for NxN or NxM search cases. Similarity search with modern chemfp is over 10x faster than chemfp 1.0!

Similarity search is fast enough that for many cases the FPS read performance became the limiting factor. This is especially noticable in web development where modern practices restart the web app after every change. The FPB binary format was developed as a way to quickly load a fingerprint dataset. Its internal layout is identical to what’s needed for a similarity search so the load step needs little additional processing. The fpcat program converts between the FPS and FPB formats.

Chemfp supports four different cheminformatics toolkits, which are used for molecule I/O and fingerprint generation. One of the goals of the chemfp API is to make it easy to work with fingerprints from different toolkits without learning the details of each toolkit. In the usual computer science fashion, this is done with the “toolkit wrapper API”, which gives a consistent API across the supported toolkits.

The “text toolkit” implements a subset of this API, to work with SDF and SMILES files as text records. The text toolkit also includes special support for working with SDF files, for example, to add fingerprint data as tag data to an SDF record without round-tripping the record through a chemistry toolkit.

With the 4.0 release, chemfp added support for diversity, including MaxMin, sphere exclusion, and heapsweep. Rather than add a new command-line program for each new tool, the “chemfp” command-line tool was added, with subcommands for each tool. The 4.1 release added the “butina”, “csv2fps” and “translate” subcommands, along with Python API additions for clustering and CSV processing. The 4.2 release added “simarray” for all-by-all NumPy array generation, switched to RDKit’s new fingerprinter generators, and added jCompoundMapper fingerprint support.

Citation

For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.

To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .

Thanks

In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield, Jeff van Santen, Jakub Gunera, Davide Boldini.

Thanks also to my wife, Sara Marie, for her many years of support.

Indices and tables