ChEMBL distributes precomputed RDKit-Morgan fingerprints in FPS format. I've reformatted the ChEMBL 34 fingerprints into FPB format along with an embedded chemfp license key to enable the full range of chemfp functionality when working with that data set.
I've also removed the ChEMBL 31 dataset. If you need it, please contact me.
The FPB file is distributed under the terms of the ChEMBL license, which is CC BY-SA 3.0.
What is chemfp?
Chemfp is a Python package for working with cheminformatics fingerprints, including high-performance Tanimoto similarity search, built-in support for RDKit, OEChem/OEGraphSim, Open Babel, and the CDK, and integration with NumPy/SciPy. It contains an extensive and well-documented Python API and a set of command-line tools for fingerprint generation, search, and format conversion.
Chemfp natively supports the text-oriented FPS, and binary-oriented FPB fingerprint file formats. A licensed FPB file contains an authorization token which enables chemfp's Tanimoto search functionality for that data set.
How to get started with the ChEMBL 34 FPB file
If you are on a Linux-based OS, and RDKit is already installed (and you are in a Python virtual environment), then here are the steps to get started:
1) Install a pre-compiled version of chemfp for Linux using the following:
python -m pip install chemfp -i https://chemfp.com/packages/
2) Download the ChEMBL data set in FPB format using one of the following:
wget https://chemfp.com/datasets/chembl_34.fpb.gz
-or-
curl -O https://chemfp.com/datasets/chembl_34.fpb.gz
or use your browser to save chembl_34.fpb.gz directly.
3) (Optional but highly recommended) Uncompress it:
gunzip chembl_34.fpb.gz
4) Do a similarity search, for examples, with a query SMILES or query file:
simsearch --query c1ccccc1O chembl_34.fpb
simsearch --queries your_queries.sdf chembl_34.fpb
For more help about the simsearch
command use simsearch --help
on the command-line or see the chapter
"Working with the Command-line Tools"
in the chemfp documentation.
5) The ChEMBL license, attribution, and legal notice of adaption are included with the dataset. They can be viewed with the following:
python -m chemfp fpb_text chembl_34.fpb
Licensing
Chemfp's base license agreement lets you use most chemfp functionality for in-house use, except that you may not use it to:
- generate FPB files;
- create in-memory fingerprint arenas with more than 50,000 fingerprints;
- search in-memory fingerprint arenas with more than 50,000 fingerprints, unless they are licensed FPB files;
- perform Tversky searches;
- perform Tanimoto searches of FPS files with more than 20 queries at a time.
These features are present but disabled in the pre-compiled Linux distribution unless a time-based chemfp license key is found.
As an alternative, most customers choose to purchase a source code license, which has no time-limit (you may continue to use it even after your support period ends) and can also be used under macOS.
No-cost academic licensing is available.
See the chemfp licensing page for more details on the licensing options and for information about how to request an evaluation license.