It’s easy to write a Tanimoto search program. Chemfp goes another few steps to make that search fast. It dynamically tests the CPU to determine the fastest popcount method. It will use the POPCNT instruction, if possible, or fall back to SSSE3.2, SSSE2, and lookup table implementations depending on the processor type and fingerprint size. It uses OpenMP to take advantage of multiple processors. It uses a variation of the sublinear search method described in Swamidass and Baldi, and organizes the fingerprint data to increase cache coherency.
While a dedicated GPU machine might be faster, the chemfp package runs on any desktop and laptop computer and can easily achieve sub-second search performance on multi-million compound data sets.
A happy chemfp customer forwarded one of their success cases. They wanted to double the number of compounds they were clustering. Their locally developed software took almost 5 days to process 1 million compounds on their compute cluster. With a chemfp-based implementation they are now able to cluster ~2.5 million 1024-bit fingerprints in under 6 hours using a Tanimoto threshold of 0.8. This uses 15 threads of a machine containing multiple Intel® Xeon® Processor E7340 processors (8M Cache, 2.40 GHz, 1066 MHz FSB).
Single query performance
Chemfp is fast, even without using high-end specialized hardware. The plots above were done using a 3.2GHz Intel Core i3 from 2010. Both the nearest-neighbor search and the threshold count searches take a fraction of a second. Human factors research shows that users feel that a system responds instantaneously if the response times is under 0.1 seconds. That means that with chemfp you can implement an interactive search which shows the k=100 neighbors, or all structures with at least 0.80 similarity, “instantly” – even as the chemist sketches a structure!
Here are some details about the plots. The target data set is ChEMBL 14, which has 1,212,539 structures. These were turned into 4096-bit OpenEye path fingerprints using the command-line
oe2fps --id-tag chembl_id chembl_14.sdf.gz -o oe_chembl_14.fps.gz The query set comes from the BindingDB similarity queries contributed to the Structure Query Collection. These were also converted into 4096-bit path fingerprints.
The plots show the distribution of the single compound search performance over a range of search parameters. For each parameter, 1,000 of the 33,907 unique query fingerprints were selected at random and used as a query. The average and standard deviation of these times were recorded, as well as the absolute minimum and maximum times. These are plotted to show the performance trend. For example, this shows that a k=10 search takes about 0.06 seconds, with a standard deviation of about 0.04 seconds. Even the worst case of 1,000 tests took only 0.15 seconds.
Your search performance will depends on the fingerprint size and the hardware. Performance is roughly linear in the number of bits, so a 1024-bit fingerprint should be about four times faster than the times reported here. A newer processor, or one with more cache, should of course be faster. On the other hand, an older processor likely doesn’t have the POPCNT machine instruction, so chemfp will instead use a slower alternative.top
The chemfp distribution includes programs which let you generate and search fingerprints on the command line. The oe2fps, ob2fps, and rdkit2fps programs convert structures into fingerprints using the OEChem, Open Babel, and RDKit toolkits. The sdf2fps program extracts fingerprints encoded in SD tag fields. The simsearch program does high-performance Tanimoto searches of an FPS file.
The command-line programs are based on the chemfp Python library.top
The chemfp distribution includes a Python package so you can write your own programs based on the chemfp functionality. You must have Python installed in order for chemfp to work.
Here’s an example of loading target fingerprints into an ‘arena’, which is a data structure for fast Tanimoto searches, then searching the arena with a set of query fingerprints from another file.
>>> import chemfp >>> targets = chemfp.load_fingerprints("pubchem_targets.fps") >>> len(targets) 3119 >>> queries = chemfp.open("pubchem_queries.fps") >>> for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7): ... print query_id, len(hits), hits[:2] ... 27575433 0  27575577 18 [('14570945', 0.74874371859296485), ('14570946', 0.73762376237623761)] 27575602 3 [('14572463', 0.72560975609756095), ('14553070', 0.75935828877005351)] 27575603 3 [('14572463', 0.72560975609756095), ('14553070', 0.75935828877005351)] 27575880 9 [('14569876', 0.72307692307692306), ('14567856', 0.73076923076923073)] 27575897 0  27577227 1 [('14570135', 0.7142857142857143)]
The chemfp documentation contains many examples of how to use the command-line tools and Python API.top
A goal of the chemfp project is to promote the FPS format as the common format for fingerprint data exchange. It is a line-oriented text format designed to be simple to read and write. Each fingerprint record is stored on its own line, where the hex-encoded fingerprint is the first column and the record identifier is the second. FPS files have an optional header with metadata such as the fingerprint size and type, the software used to create the fingerprints, and the creation time.
The chemfp fingerprint generation tools save their output as an FPS file, and simsearch searches for matches in an FPS file. If you save your fingerprint data as an FPS file then you can use simsearch to do high-speed Tanimoto searches directly on that data set. Or if you want to compare different fingerprint data sets, you can use the chemfp tools (or your own) to generate the fingerprints and in just a few lines of code, write a fingerprint reader for your analysis program.
In many cases you don’t even need write your own I/O routines. The chemfp package includes a Python library for reading and writing FPS files, and CACTVS, OpenBabel, RDKit, and R are some of the tools with existing FPS support.top
The FPS format is a portable exchange format. It’s easy to read and write, but it takes a few seconds to read a million structures into memory. This is especially annoying when the actual similarity search only takes a fraction of a second. Of course the input performance isn’t a issue for long running programs and web servers. It’s more important for web server development if the server is configured to restart each time a file changes. The extra five seconds of loading time quickly gets annoying.
Chemfp also supports the FPB binary file format. It’s much faster to load because the fingerprint records are organized so they be memory-mapped directly to the data structure chemfp uses for similarity search. This works well with chemfp’s sublinear similarity search. When there’s a high threshold, the search only needs a small fraction of the records, and not the entire contents of the file. In this case, most of the time is spent in Python startup, not in similarity search.
Because the FPB files are memory-mapped, the operating system may cache the file across multiple runs in the file system cache. The file is read-only and can be shared by multiple processes, so a compute node or web server can run multiple chemfp programs using the same data set without dedicating extra memory for each program to have is own copy of the data.top
Multiple toolkit support
The chemfp package uses third-party cheminformatics toolkits to convert structure files into fingerprints. You must install a desired toolkit yourself. The currently supported toolkits are OEChem, OpenBabel, and RDKit. When you install one of these toolkits, make sure that you either use the download which has Python support or that you enable Python bindings if compiling from source.