chemfp 4.1 is available

After over a year of development, I'm pleased to announce the release of chemfp 4.1. For full details read What’s new in chemfp 4.1 from the documentation.

Here are some of the highlights.

  • CXSMILES is now the default for SMILES input, across all of the supported toolkits, with the option to disable CXSMILES extension processing if needed.

  • Similarity searches can be exported and imported to NumPy's npz format, in a format compatible with SciPy's sparse compressed row matrix. Chemfp can also import a csr matrix using the Python API.

  • Butina clustering, both on the command-line and through the, with several variations. While generating the similarity matrix may take an hour, the core clustering algorithm takes only a few seconds for a 2 million fingerprint data set. If the matrix is saved as an npz file then it's quick to re-cluster in order to tune the clustering parameters.

  • Improvements to sphere exclusion, including parallism, new options for ranking directed exclusion, and an output format compatible with the new Butina clustering.

  • The new csv2fps command-line tool to extract ids and molecules from columns in a CSV file and use them to generate fingerprints.

  • The new translate command for simple structure conversion, using any of the four supported cheminformatics toolkits.

  • Under the covers, the biggest change was to switch command-line processing from Python's argparse module to click. The main external differences should be the error and help messages, and order of option parsing. On the inside it resulted in a lot of clean-up of years-old code which made it easier to add new options and new command-line tools.