After over a year of development, I'm pleased to announce the release of chemfp 4.1. For full details read What’s new in chemfp 4.1 from the documentation.
Here are some of the highlights.
-
CXSMILES is now the default for SMILES input, across all of the supported toolkits, with the option to disable CXSMILES extension processing if needed.
-
Similarity searches can be exported and imported to NumPy's npz format, in a format compatible with SciPy's sparse compressed row matrix. Chemfp can also import a csr matrix using the Python API.
-
Butina clustering both on the command-line and through the (Python API)[https://web.archive.org/web/20230605140818/https://chemfp.readthedocs.io/en/latest/chemfp_butina_command.html], with several variations. While generating the similarity matrix may take an hour, the core clustering algorithm takes only a few seconds for a 2 million fingerprint data set. If the matrix is saved as an npz file then it's quick to re-cluster in order to tune the clustering parameters.
-
Improvements to sphere exclusion, including parallism, new options for ranking directed exclusion, and an output format compatible with the new Butina clustering.
-
The new csv2fps command-line tool to extract ids and molecules from columns in a CSV file and use them to generate fingerprints.
-
The new translate command for simple structure conversion, using any of the four supported cheminformatics toolkits.
-
Under the covers, the biggest change was to switch command-line processing from Python's argparse module to click. The main external differences should be the error and help messages, and order of option parsing. On the inside it resulted in a lot of clean-up of years-old code which made it easier to add new options and new command-line tools.