A fast and comprehensive Python package for cheminformatics fingerprints.
chemfp is the package you’ve been looking for, if you work with binary cheminformatics fingerprints in Python.
NEW! Chemfp 4.2 was released on 10 July 2024. See the documentation for the full list of notable changes or go to the download page.
Chemfp is perhaps best known for its high-performance fingerprint similarity search. Its Taylor/Butina clustering, MaxMin diversity selection, and sphere exclusion, (including directed sphere exclusion) are equally world-class, and in some cases more than an order of magnitude faster than its competitors. Or, if you simply need a 100K by 100K distance array to pass into scikit-learn, chemfp’s simarray can generate that in less than a minute.
All of that functionality is available using command-line tools or, for those who need more customization, through a comprehensive Python API. Research scientists and IT developers will both enjoy chemfp’s extensive integration with NumPy, SciPy, and Pandas.
Do you want to evaluate the effectiveness of different fingerprint types? No other system has built-in support for RDKit, OEChem/OEGraphSim CDK, Open Babel, and jCompoundMapper fingerprints, or you can import your own fingerprints using the text-based FPS format. Chemfp even includes tools to extract fingerprints from SDF tag data or extract or process fields from CSV columns.
Do you want to develop your own analysis methods? Let chemfp handle fingerprint generation and file I/O and give you a NumPy view of the fingerprint data. You can also build on chemfp’s own components, including its cross-toolkit wrapper API for molecule I/O and its text toolkit for processing SDF and SMILES files directly as text records.
At its heart, chemfp is a Python library, designed to be a component in a larger system. Do you need to add similarity search to a Django component? Create a Jupyter widget? Write a desktop application GUI with PyQt? If you can “import chemfp” then all those are possible.
Its market-leading performance and comprehensive API make it easy for you to add fast similarity search and fingerprint analysis components anywhere you use Python.
Licensed for long-term integration
The [chemfp source license gives you the assurance that you can embed chemfp in your systems and workflows while minimizing long-term risk. We all know stories about how a license key expired unexpectedly causing the software to stop working, or about vendors changing their pricing model to extract more revenue, because they know that changing vendors would be more expensive.
With chemfp, if you purchase the unlimited source code license, you get the full source code (except for the license manager) plus a year of support, which includes no-cost updates for any new releases during that time. Most customers opt to renew support, but even if you do not, you may continue to use chemfp, and even modify and maintain it on your own.
Time-limited and binary-only licensing are also available, which may be a better fit for a small research group or for short-term projects.
If that sounds interesting
You can get started by downloading the pre-compiled Linux version of chemfp using the following:
python -m pip install chemfp -i https://chemfp.com/packages/
A few features are either limited or disabled. Visit the licensing page to see the licensing terms, to request an evaluation key to unlock those features, and learn about some of the available licensing options.
You do not need to request a license key for Tanimoto searches of the licensed FPB files available from the datasets page, so long as you follow the terms of the Chemfp Base License Agreement.
More information
Chemfp includes extensive documentation. For a more scholarly description, see: Dalke, A. The chemfp project. J. Cheminformatics 11, 76 (2019). doi: 10.1186/s13321-019-0398-8
Open source reference baseline for benchmarking
Chemfp 1.6.1 is the latest version of the no-cost/open source chemfp development track. It only supports Python 2.7. It is being maintained in order to provide a good reference baseline to evaluate similarity search performance, and to support the dwindling number of legacy users who haven't moved to Python 3. See the download page for download details.