Installing chemfp

Chemfp 4.2 is available as a pre-compiled package or a source distribution.

Installing a pre-compiled package

Pre-compiled chemfp distributions are available for Python 3.8 - Python 3.12. They were compiled under the “manylinux2014” Docker build environment, which means they should work for most Linux-based operating systems. While chemfp supports macOS, pre-compiled macOS distributions are not available.

These binary packages are NOT open source. By default they are distributed under the Chemfp Base License Agreement v1.3, which lets you use some of the chemfp functionality for internal purposes, including the ability to create FPS files and use the “toolkit” APIs.

However, in general the following features require a time-limited license key:

  • generate FPB files;

  • create in-memory fingerprint arenas with more than 50,000 fingerprints;

  • use the simarray functionality to process fingerprint sets with more than 20,000 fingerprints;

  • use other search methods on fingerprint arenas with more than 50,000 fingerprints;

  • perform Tversky searches;

  • perform Tanimoto searches of FPS files with more than 20 queries at a time.

These features can be enabled with a valid license key, set via the environment variable CHEMFP_LICENSE. Email sales@dalkescientific.com to request a evaluation license or to purchase a license.

These functions are also enabled and allowed when using a licensed FPB file containing a chemfp authorization key.

If you use pip you should likely also use a virtual environment. Set that up first (or the equivalent in conda) and activate it.

Use the following command to install a pre-compiled version of chemfp and (if not already installed) its “click” and “tqdm” third-party dependencies:

python -m pip install chemfp -i https://chemfp.com/packages/

If you get the message:

ERROR: Could not find a version that satisfies the requirement chemfp (from versions: none)
ERROR: No matching distribution found for chemfp

then you are likely installing from a non-Linux-based operating system like macOS or Microsoft Windows. Pre-compiled installers are not yet available for those OSes. Currently macOS is supported in the source distribution and Windows is not yet supported.

Installing from source

This section applies only if you have purchased a source code license to chemfp and have a copy of that distribution.

The chemfp source distribution requires that Python and a C compiler be installed in your machines. Since chemfp doesn’t yet run on Microsoft Windows (for tedious technical reasons), then your machine likely already has both Python and a C compiler installed. In case you don’t have Python, or you want to install a newer version, you can download a copy of Python from http://www.python.org/download/ . If you don’t have a C compiler, .. well, do I really need to give you a pointer for that?

chemfp 4.2 supports Python 3.8 or newer and has been tested up to Python 3.12.

The core chemfp functionality does not depend on a third-party library but you will need a chemistry toolkit in order to generate new fingerprints from structure files. chemfp supports the free Open Babel, RDKit, and CDK toolkits and the proprietary OEChem/OEGraphSim toolkits. Make sure you install the Python libraries for the toolkit(s) you select.

The easiest way to install chemfp is with the pip installer. This comes with all supported versions of Python.

If you use pip you should likely also use a virtual environment. Set that up first (or the equivalent in conda) and activate it.

To install the source distribution tar.gz file with pip:

python -m pip install chemfp-4.2.tar.gz

If not already installed this will also install chemfp’s “click” and “tqdm” third-party dependencies.

If you are making in-house modifications, you likely want to use the --editable option.

Configuration options

There are several compile-time options, which can be specified through environment variables.

OpenMP compile-time options

The default is to compile chemfp with OpenMP support. You will want to disable this if your compiler doesn’t support OpenMP.

If you get a message like:

unrecognized command line option "-fopenmp"
   -or-
ld: library not found for -lgomp

then your compiler does not understand OpenMP. You are likely on a Mac where Apple’s version of the clang compiler does not support OpenMP.

You have two options:

1) Install a C compiler which supports OpenMP and use the CC environment variable to specify the compiler:

env CC=gcc-14 python -m pip install chemfp-4.2.tar.gz

On macOS I install gcc using the “brew” packaging system:

brew install gcc-14

2) Set the CHEMFP_OPENMP environment variable to “0” to compile chemfp without OpenMP support, like this:

env CHEMFP_OPENMP=0 python -m pip install chemfp-4.2.tar.gz

This will let you compile chemfp on macOS with Apple’s version of clang.

Multiple architectures

On macOS your Python installation might be compiled for both x86-64 and arm64 architectures. When it builds Python/C extensions like chemfp, it uses the C compiler flags:

-arch arm64 -arch x86_64

My installation of gcc-11 (from 2021) does not understand “arm64” and causes the build process to fail with:

gcc-11: error: this compiler does not support arm64

To install chemfp for this case, set the environment variable CHEMFP_ARCH to “x86_64” to have the chemfp build process remove the existing “-arch” compiler flags and replace them with “-arch x86_64”:

env CC=gcc-11 CHEMFP_ARCH=x86_64 python -m pip install chemfp-4.2.tar.gz

The environment variable may be a comma-separated list of architectures, like:

env CC=gcc-11 CHEMFP_ARCH=arm64,x86_64 python -m pip install chemfp-4.2.tar.gz

which would restore the original architecture list.

Missing architecture

My brew-based installation of gcc-13 (from 2023) gives a warning instead of an error, saying:

gcc-13: warning: this compiler does not support Arm64 ('-arch' option ignored)

It compiles chemfp, but only for x86-64.

If you are on a Mac with Apple Silicon (an ARM-based processor like the M1 or M2), and you did not compile chemfp for arm support, and if your version of Python uses native ARM code, rather than Rosetta-translated x86-64 code, then you will get an error message like this:

>>> import chemfp.bitops
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/bitops.py", line 40, in <module>
    from . import _bitops
ImportError: dlopen(chemfp/_bitops.cpython-312-darwin.so, 0x0002): tried:
'chemfp/_bitops.cpython-312-darwin.so' (mach-o file, but is an incompatible
architecture (have 'x86_64', need 'arm64e'))

The best solution is to install everything for ARM support.

If that isn’t an option, you can use “arch” command to start Python in x86-64 mode, like this:

arch -arch x86_64 python

and run the chemfp commands the same way, like:

arch -arch x86_64 sdf2fps

Or if your Python installation supports it, you can use the “*-intel64” variant, like this:

python3.12-intel64

but you’ll need to ensure that python3.12-intel64 exists in your virtual environment, as Python’s “venv” doesn’t include that in the environment. On my system, I’ve installed Python 3.12 using the installer from Python.org, and created the “py312-2024-1” virtual environment. The following shows how the “intel64” variant isn’t included in the environment, and how it does not know how to import “chemfp”:

% which python3.12
/Users/dalke/venvs/py312-2024-1/bin/python3.12
% which python3.12-intel64
/Library/Frameworks/Python.framework/Versions/3.12/bin/python3.12-intel64
% python3.12-intel64
Python 3.12.2 (v3.12.2:6abddd9f6a, Feb  6 2024, 17:02:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import chemfp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'chemfp'
% exit()

The fix for me was to add a symbolic link from the virtual environment to the actual binary:

% ln -s \
/Library/Frameworks/Python.framework/Versions/3.12/bin/python3.12-intel64 \
/Users/dalke/venvs/py312-2024-1/bin/

If you want the command-line tools to work then you’ll also need to install chemfp with that version of Python, so the installed tools have the correct “bang path”, that is, so the installed scripts are configured to have the OS run the correct version of Python:

python3.12-intel64 -m pip install chemfp-4.2.tar.gz

AVX2 support

Chemfp x86-64 architectures uses the AVX2 instruction set, if available, for faster Tanimoto calculation performance of larger fingerprints, like 1024 and 2048 bits.

If you get a message like:

 cc1: error: invalid option avx2
-or-
 cc1: error: unrecognized command line option "-avx2"

then your compiler does not understand the request to use AVX2 intrinsics.

In the unlikely event that happens (AVX2 came out in the early 2010s), set the CHEMFP_AVX2 environment variable to 0:

env CHEMFP_AVX2=0 python -m pip install chemfp-4.2.tar.gz

The AVX2 compiler flag will only be used to compile popcount_avx2.c.

Installing CDK and JPype

CDK is a Java package. Chemfp is written for Python. Chemfp supports CDK. How can chemfp call into CDK?

There are several ways for Python programs to call into Java. I tried two of them and ended up using JPype, following Noel O’Boyle’s suggestion.

There are a few ways to install JPype. The easiest is likely to use conda (see the documentation for details) or, if you have the the Java run-time, you can pip install it with:

python -m pip install JPype1

This installs the jpype module for Python.

You’ll also need to put the CDK JAR on the CLASSPATH. For example, in the following I download the JAR file then set the CLASSPATH using bash syntax:

cd ~/ftps
curl -LO https://github.com/cdk/cdk/releases/download/cdk-2.3/cdk-2.3.jar
export CLASSPATH=/Users/dalke/ftps/cdk-2.3.jar

(I put my manually downloaded packages in ~/ftps/ for historic reasons.)

Use cdk2fps --version to diagnose if things are working. If it’s a success it should look like:

% cdk2fps --version
cdk2fps 4.2

The following message occurs if jpype isn’t installed:

Cannot run cdk2fps: Cannot import jpype, which is required for
chemfp to access the CDK jar: No module named 'jpype'

The following message occurs if jpype is installed (eg, via pip) but either Java isn’t installed on your machine or jpype couldn’t find your installation:

Cannot run cdk2fps: No JVM shared library file (libjvm.so)
found. Try setting up the JAVA_HOME environment variable properly.

A message like the following occurs on macOS if your OpenJCK installation only supports x86_64 while you are using arm64:

Cannot run cdk2fps: It appears that CDK is not installed: Cannot
start the JVM using jpype: [Errno 0] JVM DLL not found:
/usr/local/Cellar/openjdk@11/11.0.18/libexec/openjdk.jdk/Contents/Home/lib/jli/libjli.dylib

In that case, either install and use ARM version of Homebrew, or see the earlier section about using python3.12-intel64 to start Python in x86-64 mode through Rosetta.

The following message occurs if the CDK JAR file is not on the CLASSPATH:

Cannot run cdk2fps: It appears that CDK is not installed: Unable to
access the CDK jar via JPype. Is the jar on your CLASSPATH?: Failed
to import 'org.openscience'

Installing jCompoundMapper

jCompoundMapper is a Java package for generating a variety of chemical fingerprints. It was developed in 2011 using CDK version 1.35.

The 2004 paper “Effectiveness of molecular fingerprints for exploring the chemical space of natural products” by Boldini, Ballabio, Consonni, Todeschini, Grisoni, and Sieber, J. Cheminf. (2024) 16:35, https://doi.org/10.1186/s13321-024-00830-3 benchmarked “20 molecular fingerprints from four different sources” and observed that the “ASP” (“All Shortest Paths”) and “LSTAR” jCompoundMapper fingerprints were promising.

In order to help others investigate them further, chemfp 4.2 added support for jCompoundMapper’s DFS, ASP, LSTAR, RAD2D, PH2, and PH3 fingerprint types.

To get started, download either the binary executable jar jCMapperCLI.jar or the jar library jCMapperLibOnly.jar and place it on your CLASSPATH after the CDK jar. For example, I have:

CLASSPATH=$HOME/cdk_jars/cdk-2.9.jar:$HOME/cdk_jars/jCMapperLibOnly.jar

If chemfp sees the filenames “jcMapperCLI.jar” or “jCMapperLibOnly.jar” on the CLASSPATH then chemfp will start the JVM using -DCdkUseLegacyAtomContainer=t, which configures the modern CDK to use the legacy atom container expected by jCompoundMapper.

This lets chemfp use the modern CDK for file I/O, and jCompoundMapper to generate fingerprints.

However, this mode is incompatible with CDK’s own fingerprint types, which (with one exception) do not work with the legacy atom container.

To test if it works, try:

% echo "CCC id1" | cdk2fps --type "jCMapper-DFS hashsize=128"
[WARN] Using the old AtomContainer implementation.
#FPS1
#num_bits=128
#type=jCMapper-DFS/1 hashsize=128 searchDepth=7
#software=CDK/2.9 jCMapper/1 chemfp/4.2b2
#date=2024-05-31T15:11:58
00100000000000000800008000200000      id1

The line starting “[WARN]” come from the CDK, and indicates that CDK has successfully started up in legacy mode.