I've just released chemfp 3.2, available immediately for paying chemfp customers. If you are interested in purchasing a chemfp license, email me.
This version mostly contains bug fixes and internal improvements. The biggest additions are support for Dave Cosgrove's 'flush' fingerprint file format, and support for 'fromAtoms' in some of the RDKit fingerprints.
Detailed description of the changes
The configuration has changed to use setuptools.
Previously the command-line programs were installed as small scripts. Now they are created and installed using the "console_scripts" entry_point as part of the install process. This is more in line with the modern way of installing command-line tools for Python.
If these scripts are no longer installed correctly, please let me know.
If you have installed the
chemfp_converters package
then chemfp will use it to read and write fingerprint files in flush
format. It can be used as output from the *2fps
programs, as input and
output to fpcat, and as query input to simsearch.
Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and pair fingerprints. This is primarily useful if you want to generate the circular environment around specific atoms of a single molecule, and you know the atom indices. If you pass in multiple molecules then the same indices will be used for all of them. Out-of-range values are ignored.
The command-line option is "--from-atoms", which takes a comma-separated list of non-negative integer atom indices. For examples:
--from-atoms 0
--from-atoms 29,30
The corresponding fingerprint type strings have also been updated. If fromAtoms is specified then the string "fromAtoms=i,j,k,..." is added to the string. If it is not specified then the fromAtoms term is not present, in order to maintain compability with older types strings. (The philosophy is that two fingerprint types are equivalent if and only if their type strings are equivalent.)
The --from-atoms option is only useful when there's a single query and when you have some other mechanism to determine which subset of the atoms to use. For example, you might parse a SMILES, use a SMARTS pattern to find the subset, get the indices of the SMARTS match, and pass the SMILES and indices to rdk2fps to generate the fingerprint for that substructure.
Be aware that the union of the fingerprint for --from-atoms X and the fingerprint for --from-atoms Y might not be equal to the fingerprint for --from-atoms X,Y. However, if a bit is present in the union of the X and Y fingerprints then it will be present in the X,Y fingerprint.
Why? The fingerprint implementation first generates a sparse count fingerprint, then converts that to a bitstring fingerprint. The conversion is affected by the feature count. If a feature is present in both X and Y then X,Y fingerprint may have additional bits sets over the individual fingerprints.
Fixed a bug in FPB identifier index lookup. When the id's hash didn't exist, it got stuck in an infinite loop. There is a special token to identify the end of the hash chain. Unfortunately, that token wasn't marked as a b"byte string" during the Python 2to3 conversion, so that token was never found, causing the code to loop over the chain forever. It is now a byte string, and a check was added to prevent infinite loops.
Fixed a bug where a k=0 similarity search using an FPS file as the targets caused a segfault. The code assumed that k would be at least 1. If you do a k=0 search, it will currently read the entire file, checking for format errors, and return no hits.
Chemfp no longer generates Python warnings. That is, the regression
tests all pass under python -W error unit2 discover
. The biggest
problem was the ResourceWarning from all of the files which were never
explicitly closed. They used to depend on the garbage collector to
close the file but now use either through a context manager or with
close(). In addition, several strings contains invalid escape
characters and some regression tests used deprecated APIs.
The context manager and close() method for the FPBFingerprintAreana now close the underlying file object/mmap rather than depend on the garbage collector.
The readers and writers which are wrappers to an iterator which may hold a file object, and where the file object was created by chemfp, now know to close() the wrapped iterator when processing is over.
Added a check that the threshold and count symmetric arena searches have a popcount. Unordered arenas caused the code to segfault.