chemfp 4.2 documentation¶
chemfp is the package you’ve been looking for, if you work with binary cheminformatics fingerprints in Python.
Chemfp is perhaps best known for its high-performance fingerprint similarity search. Its Taylor/Butina clustering, MaxMin diversity selection, and sphere exclusion, (including directed sphere exclusion) are equally world-class, and in some cases more than an order of magnitude faster than its competitors. Or, if you simply need a 100K by 100K distance array to pass into scikit-learn, chemfp’s simarray can generate that in less than a minute.
All of that functionality is available using command-line tools or, for those who need more customization, through a comprehensive Python API. Research scientists and IT developers will both enjoy chemfp’s extensive integration with NumPy, SciPy, and Pandas.
Do you want to evaluate the effectiveness of different fingerprint types? No other system has built-in support for RDKit, OEChem/OEGraphSim, CDK, Open Babel, and jCompoundMapper fingerprints, or you can import your own fingerprints using the text-based FPS format. Chemfp even includes tools to extract fingerprints from SDF tag data. or extract or process fields from CSV columns.
Do you want to develop your own analysis methods? Let chemfp handle
fingerprint generation and file I/O
and give you a NumPy view of the fingerprint data
. You can also build on chemfp’s
own components, including its cross-toolkit wrapper API for molecule I/O and its text toolkit for processing SDF and SMILES files directly
as text records.
At its heart, chemfp is a Python library, designed to be a component in a larger system. Do you need to add similarity search to a Django component? Create a Jupyter widget? Write a desktop application GUI with PyQt? If you can “import chemfp” then all those are possible.
Summary and recent release details¶
chemfp is a set of command-line tools and a Python package for working with binary cheminformatics fingerprints, typically between several hundred and a few thousand bits long.
Chemfp 4.2 was released on 10 July 2024 with simarray support to generate an all-by-all NumPy array, updates to the latest RDKit fingerprint generators, including count simulation, and jCompoundMapper support. See What’s new in chemfp 4.2 for details.
It was tested with Python 3.9-3.12. It requires the “click” and “tqdm” third-party packages, which should be installed automatically as part of the normal installation process. Some optional features will only work if they are installed by other methods, like the NumPy, SciPy, and Pandas integration.
Chemfp 4.1 was released on 17 May 2023. with CXSMILES support, methods to save and load similarity search results to a SciPy sparse matrix, Butina clustering, methods to work with CSV files, and a tool to convert between structure formats. See What’s new in chemfp 4.1 for details.
Chemfp 4.0 added new methods for diversity selection and improves API usability with new high-level functions and improved feedback for interactive use (including progress bars!). See What’s new in chemfp 4.0 for details.
Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. The supported toolkits are OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter). jCompoundMapper requires the CDK jar on the CLASSPATH folllowed by either “jCMapperCLI.jar” or “jCMapperLibOnly.jar”, available from its Sourceforge distribution.
Table of Contents¶
- Installing chemfp
- Working with the chemfp command-line tools
- Generate fingerprint files from PubChem SD tags
- Generate fingerprint files from CSV files
- k-nearest neighbor search
- simsearch CSV output
- Threshold search
- simsearch CSV output when no hits
- Combined k-nearest and threshold search
- Saving simsearch results to “npz” format
- NxN (self-similar) searches
- Using a toolkit to process the ChEBI dataset
- Use structures as input to simsearch
- Make new fingerprints matching the type in an existing file
- Alternate error handlers
- chemfp’s two cross-toolkit substructure fingerprints
- substruct fingerprints
- Generate binary FPB files from a structure file
- Convert between FPS and FPB formats
- Specify the fpcat output format
- Alternate fingerprint file formats
- The FPB format
- Get licensed FPB files containing ChEMBL 34 fingerprints
- Similarity search with the FPB format
- Multi-core similarity search
- Converting large data sets to FPB format
- Faster gzip decompression
- Generate fingerprints in parallel and merge to FPB format
- Generate a full Tanimoto similarity array
- Generate the array using another metric or datatype
- Generate a simarray using an “abcd” metric
- Simarray and arrays larger than memory
- Butina on the command-line
- Alternate Butina output formats
- Butina parameter tuning with npz files
- Sphere exclusion
- Sphere exclusion output options
- Sphere exclusion with initial picks, and saving candidates
- Directed sphere exclusion
- MaxMin diversity selection
- The initial MaxMin pick
- MaxMin diversity selection including references
- Heapsweep diversity selection
- Structure file format translation
- Structure translation reader and writer args
- Structure translation with two toolkits
- The chemfp command-line tools
- Getting started with the API
- Get ChEMBL 34 fingerprints in FPB format
- Similarity search of ChEMBL by id
- Similarity search of ChEMBL using a SMILES
- Sorting the search results
- Fingerprints as byte strings
- Generating Fingerprints
- Similarity Search of ChEMBL by fingerprint
- Load fingerprints into an arena
- Generate an NxN similarity matrix
- Generate an NxM similarity matrix
- Exporting SearchResults to SciPy and NumPy
- Save simsearch results to an npz file
- Use simarray to generate a NumPy array
- Compute your own metric with “abcd”
- Save a simarray to an npy file
- Generating fingerprint files
- Generating fingerprints with an alternative type
- Extracting fingerprints from SDF tags
- Butina clustering
- Butina clustering parameters
- Butina clustering with a precomputed matrix
- The effect of Butina threshold size when clustering ChEMBL
- Select diverse fingerprints with MaxMin
- Use MaxMin with references
- Select diverse fingerprints with Heapsweep
- Sphere exclusion
- Directed sphere exclusion
- Fingerprint family and type examples
- Fingerprint families and types
- Fingerprint family
- Fingerprint family discovery
- get_fingerprint_type() and get_type()
- Create a fingerprint using text settings
- FingerprintType properties and methods
- Convert a structure record to a fingerprint
- Convert a structure record to an id and fingerprint
- Make a specialized id and molecule fingerprint parser
- Read a structure file and compute fingerprints
- Structure-based fingerprint reader location
- Read fingerprints from a string containing structures
- Structure-based fingerprint reader errors
- Use your own error handler
- Compute a fingerprint for a native toolkit molecule
- Fingerprint many native toolkit molecules
- Make a specialized molecule fingerprinter
- Toolkit API examples
- Get a chemfp toolkit
- Parse and create SMILES
- Canonical, non-isomeric, and arbitrary SMILES
- Use format to create a record in SDF format
- Use zlib record compression
- Use zst record compression
- Get a list of available formats and distinguish between input and output formats
- Determine the format for a given filename
- Parse the id and the molecule at the same time
- Specify alternate error behavior
- Specify a SMILES delimiter through reader_args
- Specify an output SMILES delimiter through writer_args
- RDKit-specific SMILES reader_args and writer_args
- OpenEye-specific SMILES reader_args and writer_args
- OpenEye-specific aromaticity
- Open Babel-specific SMILES reader_args and writer_args
- CDK-specific SMILES reader_args and writer_args
- Get the default reader_args or writer_args for a format
- Convert text settings into reader and writer arguments
- Multi-toolkit reader_args and writer_args
- Qualified reader and writer parameters names
- Qualified parameter priorities
- Qualified names and text settings
- Read molecules from an SD file or stdin
- Read ids and molecules from an SD file at the same time
- Read ids and molecules using an SD tag for the id
- Read from a string instead of a file
- The reader may reuse molecule objects!
- Write molecules to a SMILES file
- Reader and writer context managers
- Write molecules to stdout in a specified format
- Write molecules to a string (and a bit of InChI)
- Handling errors when reading molecules from a string
- Handling errors when reading molecules from a file
- Ignore errors in create_string() and create_bytes()
- Ignore errors when writing molecules
- Reader and writer format metadata
- Location information: filename, record_format, recno and output_recno
- Location information: record position and content
- Writing your own error handler (Advanced)
- A Babel-like structure format converter
- argparse text settings to reader and writer args
- Creating a specialized record parser
- Molecule API: Get and set the molecule id
- Molecule API: Copy a molecule
- Molecule API: Working with SD tags
- Add fingerprints to an SD file using a toolkit
- Text toolkit examples
- Toolkits may modify the molecular structure
- Toolkits may modify SDF syntax
- The text toolkit “molecules”
- The text toolkit implements the toolkit API
- Reading and adding SD tags with the text_toolkit
- Synchronizing readers from different toolkits through the text toolkit
- Add multiple toolkit fingerprints to an SD file
- Text toolkit and SDF files
- Read id and tag value pairs from an SD file
- Extract the id and atom and bond counts from an SD file
- SDF-specific parser parameters
- Working with SD records as strings
- Unicode and other character encoding
- Mixed encodings and raw bytes
- chemfp API
- chemfp top-level
- chemfp.arena
- chemfp.base_toolkit
- chemfp.bitops
- chemfp.cdk_toolkit
- chemfp.cdk_types
- chemfp.clustering
- chemfp.csv_readers
- chemfp.diversity
- chemfp.encodings
- chemfp.fpb_io
- chemfp.fps_io
- chemfp.fps_search
- chemfp.highlevel.clustering
- chemfp.highlevel.conversion
- chemfp.highlevel.diversity
- chemfp.highlevel.simarray
- chemfp.highlevel.similarity
- chemfp.io
- chemfp.jcmapper_types
- chemfp.openbabel_toolkit
- chemfp.openbabel_types
- chemfp.openeye_toolkit
- chemfp.openeye_types
- chemfp.rdkit_toolkit
- chemfp.rdkit_types
- chemfp.search
- chemfp.simarray_io
- chemfp.text_records
- chemfp.text_toolkit
- chemfp.toolkit
- chemfp.types
- Overview
- Licenses
- What’s new in chemfp 4.2
- What’s new in chemfp 4.1
- Required dependencies on click and tqdm
- CXSMILES
- New structure format specifiers
- New SearchResult(s) attributes
- Save simsearch results to “npz” files
- chemfp.npy file entry in an npz file
- Working with npz files in chemfp
- Importing SciPy csr matrices to chemfp
- Butina clustering
- Butina on the command-line
- Butina parameter tuning
- High-level Butina API
- Changed default output format name
- Output metadata options
- spherex changes
- csv background
- csv2fps command
- csv2fps TODO
- chemfp.csv_readers module
- New toolkit wrapper functions to read CSV files
- translate command
- translate_record function
- Structure I/O helper functions
- Other API changes
- What’s new in chemfp 4.0
Background¶
Chemfp started because around 2007 a project I worked on needed a way to include nearest-neighbor information for a property prediction calculator. The cheminformatics toolkits at the time didn’t include that as a built-in tool, though they did supply the components to build your own. Indeed, asking showed that nearly everyone had built their own similar sorts of tools, each with a different format, and varying levels of performance that were nowhere near the hardware limits.
The first step was to develop the FPS format, a human-readable text-based exchange format for fingerprint data that is easy for software to read and write. It stores fingerprint records containing a hex-encoded fingerprint and a record identifier, as well as metadata like the associated fingerprint type.
People don’t use an alternate format just because it exists, so the next step was to develop useful command-line tools for fingerprint generation and similarity search, as well as a Python API for working with fingerprints in a discovery setting - like adding similarity search to a web app! Alternatively, the sdf2fps tool can extract fingerprint data from an SDF tag field.
People don’t use alternative tools just because they exist, so the third step was improve similarity search performance. This was done by improving the search algorithm and implementation and adding multi-threaded support for NxN or NxM search cases. Similarity search with modern chemfp is over 10x faster than chemfp 1.0!
Similarity search is fast enough that for many cases the FPS read performance became the limiting factor. This is especially noticable in web development where modern practices restart the web app after every change. The FPB binary format was developed as a way to quickly load a fingerprint dataset. Its internal layout is identical to what’s needed for a similarity search so the load step needs little additional processing. The fpcat program converts between the FPS and FPB formats.
Chemfp supports four different cheminformatics toolkits, which are used for molecule I/O and fingerprint generation. One of the goals of the chemfp API is to make it easy to work with fingerprints from different toolkits without learning the details of each toolkit. In the usual computer science fashion, this is done with the “toolkit wrapper API”, which gives a consistent API across the supported toolkits.
The “text toolkit” implements a subset of this API, to work with SDF and SMILES files as text records. The text toolkit also includes special support for working with SDF files, for example, to add fingerprint data as tag data to an SDF record without round-tripping the record through a chemistry toolkit.
With the 4.0 release, chemfp added support for diversity, including MaxMin, sphere exclusion, and heapsweep. Rather than add a new command-line program for each new tool, the “chemfp” command-line tool was added, with subcommands for each tool. The 4.1 release added the “butina”, “csv2fps” and “translate” subcommands, along with Python API additions for clustering and CSV processing. The 4.2 release added “simarray” for all-by-all NumPy array generation, switched to RDKit’s new fingerprinter generators, and added jCompoundMapper fingerprint support.
Citation¶
For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .
Advertisement¶
This program was developed by Andrew Dalke <dalke@dalkescientific.com>, Andrew Dalke Scientific, AB. The Base License Agreement does not allow you to:
generate FPB files;
create in-memory fingerprint arenas with more than 50,000 fingerprints;
use the simarray functionality to process fingerprint sets with more than 20,000 fingerprints;
use other search methods on fingerprint arenas with more than 50,000 fingerprints;
perform Tversky searches;
perform Tanimoto searches of FPS files with more than 20 queries at a time.
See https://chemfp.com/license/ for details on licensing options, which includes no-cost academic licensing and source code licensing.
These functions are also enabled and allowed when using a licensed FPB file containing a chemfp authorization key.
If you have questions, or wish to request a demo license or purchase a license, send an email to sales@dalkescientific.com.
I also maintain the chemfp-1.x series under a no-cost open source license. Version chemfp-1.6.1 is available at no cost from chemfp.com. This version requires Python 2.7 and is meant to give an open source baseline for benchmarking purposes.
Thanks¶
In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield, Jeff van Santen, Jakub Gunera, Davide Boldini.
Thanks also to my wife, Sara Marie, for her many years of support.