Chemfp is fast. I'll even say it's very fast. But you probably want concrete numbers to back up that statement.
The results here were generated using the chemfp_benchmark. It uses test data sets with 166-bit (OpenEye MACCS), 881-bit (PubChem), 1024-bit (Open Babel FP2), and 2048-bit (RDKit Morgan) fingerprints. The 881-bit fingerprints were extracted from randomly selected PubChem records and the others were generated using randomly selected ChEMBL records.
Each target data set contains 1 million fingerprints. The corresponding query data set is not a subset of the target data set but is instead drawn from the full set of input fingerprints.
The benchmark then runs a number of different operations to get a sense of the performance.
The benchmark ran on machine with an "Intel Core i7-4770 CPU @ 3.40GHz" running at 3.7 GHz using Python 3.6.2 and chemfp 3.2.1.
Chemfp currently uses a single-threaded search for each query fingerprint, so the single query searches are not affected by the number of processors. In general the performance is linear in the number of fingerprints, so a 100M fingerprint database should take 100x longer than the times reported here.
k=10, single fingerprint query, in-memory search
Chemfp is designed to be called directly from a web application, assuming that application is written in Python. For example, you might have an application which displays information about a structure, and you want to change it to also display the nearest 10 neighbors in a target dataset containing a few million fingerprints. In that case you want the single-threaded search performance to be fast enough that it doesn't add noticeable overhead to the overall response time, even in the worst case.
This benchmark reports the timings of 1,000 k=10 nearest-neighbor queries for each of the 1M target data sets. In general the performance should be linear in the number of bits. The graphs below show that the fingerprint type also affects the performance.
It is also possible to ask chemfp to find the k-nearest neighbors with a similarity at or above a given threshold, which would reduce the search time.
k-nearest scaling, single fingerprint query, 1024 bits, in-memory search
A k=1 nearest search is faster than a k=10 or k=100 nearest search. The following benchmark shows the effect of changing the value of k for the 1024-bit data set.
threshold=0.7, single fingerprint query, in-memory search
This benchmark finds all matches at or above 0.7, and not just the first k. This is a high threshold, which means chemfp's sublinear search bounds are able to speed up the search by skipping most of the search space.
The previous benchmarks show the timing of a single query against a data set. Chemfp also supports "NxM" search to find similar matches for multiple queries against the data base. The NxM search is parallelized using OpenMP, where each query is run in its own thread.
Chemfp also has special support for the "NxN" case, where the same data set is used as both the queries and targets, excluding the diagonal. This is often used to construct a similarity matrix for clustering, along with the optimization that any similarity below a certain threshold can be treated as having no similarity.
In general, the similarity matrix construction time for the NxN case grows as N2 in the number of elements, linear in the number of fingerprint bits, and roughly linear in the number of processors until the memory bandwidth is saturated.
That is, each core is pulling data out of main memory as fast as it can. Let's say it processes 1Gbit per second, and the main memory bandwidth is 20Gbit/s. Each core adds another 1GBit per second, so with 20 CPUs there is simply no way for the memory to get to the CPU. This model is overly simple, but it helps understand why doubling the number of CPUs doesn't always double the performance.
In practice there is decent speedup on a 15-core machine. No one has reported their experience on larger configurations.
The following benchmarks try to convey an idea of how chemfp scales with the number of processors, fingerprint size/type, and threshold value. For the 250K and 500K cases, the fingerprint subsets were randomly sampled and the timings run several times; the timing variability was at the level of the least significant digits.
(= 2m 54s)
(= 8m 24s)
(= 5m 12s)
(= 22m 40s)