I've just released chemfp 1.4, the newest version of the chemfp 1.x development track.
This version mostly contains bug fixes and internal improvements. The biggest additions are the fpcat command-line program, support for Dave Cosgrove's 'flush' fingerprintfile format, and support for 'fromAtoms' in some of the RDKit fingerprints.
chemfp 1.4 is available at no cost under the MIT license.
Detailed description of the changes
The configuration has changed to use setuptools.
Previously the command-line programs were installed as small scripts. Now they are created and installed using the "console_scripts" entry_point as part of the install process. This is more in line with the modern way of installing command-line tools for Python.
If these scripts are no longer installed correctly, please let me know.
The fpcat command-line tool was back-ported from chemfp 3.1. It can be used to merge a set of FPS files together, and to convert to/from the flush file format. This version does not support the FPB file format.
If you have installed the
chemfp_converters package
then chemfp will use it to read and write fingerprint files in flush
format. It can be used as output from the *2fps
programs, as input and
output to fpcat, and as query input to simsearch.
Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and pair fingerprints. This is primarily useful if you want to generate the circular environment around specific atoms of a single molecule, and you know the atom indices. If you pass in multiple molecules then the same indices will be used for all of them. Out-of-range values are ignored.
The command-line option is "--from-atoms", which takes a comma-separated list of non-negative integer atom indices. For examples:
--from-atoms 0
--from-atoms 29,30
The corresponding fingerprint type strings have also been updated. If fromAtoms is specified then the string "fromAtoms=i,j,k,..." is added to the string. If it is not specified then the fromAtoms term is not present, in order to maintain compability with older types strings. (The philosophy is that two fingerprint types are equivalent if and only if their type strings are equivalent.)
The --from-atoms option is only useful when there's a single query and when you have some other mechanism to determine which subset of the atoms to use. For example, you might parse a SMILES, use a SMARTS pattern to find the subset, get the indices of the SMARTS match, and pass the SMILES and indices to rdk2fps to generate the fingerprint for that substructure.
Be aware that the union of the fingerprint for --from-atoms X and the fingerprint for --from-atoms Y might not be equal to the fingerprint for --from-atoms X,Y. However, if a bit is present in the union of the X and Y fingerprints then it will be present in the X,Y fingerprint.
Why? The fingerprint implementation first generates a sparse count fingerprint, then converts that to a bitstring fingerprint. The conversion is affected by the feature count. If a feature is present in both X and Y then X,Y fingerprint may have additional bits sets over the individual fingerprints.
The ob2fps, rdk2fps, and oe2fps programs now also include the chemfp version information on the software line of the metadata. This improves data provenance because the fingerprint output might be affected by a bug in chemfp.
The Metadata 'date' is now always a datetime instance, and not a string. If you pass a string into the Metadata constructor, like Metadata(date="datestr"), then the date will be converted to a datetime instance. Use "metadata.datestamp" to get the ISO string representation of the Metadata date.
Fixed a bug where a k=0 similarity search using an FPS file as the targets caused a segfault. The code assumed that k would be at least 1. With the fix, a k=0 search will read the entire file, checking for format errors, and return no hits.
Fixed a bug where only the first ~100 queries against an FPS target search would return the correct ids. (Forgot to include the block offset when extracting the ids.)
Fix a bug where if the query fingerprint had 1 bit set and the threshold was 0.0 then the sublinear bounds for the Tanimoto searches (used when there is a popcount index) failed to check targets with 0 bits set.