What’s new in chemfp 4.1¶
Released 17 May 2023.
The main themes for chemfp 4.1 are “CXSMILES”, “Butina clustering”, “CSV support”, and “embrace click and required third-party dependencies.”
Required dependencies on click and tqdm¶
Chemfp 4.1 switched to click for command-line processing. Previously it used argparse from the Python standard library, but I found it easier to develop and maintain click-based processing.
The change resulted in a lot of minor changes. The error messages have
changed. Command-line arguments are processed in a different order,
which may change when errors and how are detected. Argparse supported
-h
as an alias for --help
but click does not. Some of the exit
codes have changed.
There is a subtle change in how the --delimiter
and
--has-header
options work with the -R
option for
reader_args. Previously these were processed in left-to-right
order. In chemfp 4.1 the delimiter
and has-header
options are
processed after -R
, and override those settings.
The change took a lot of work, but also led to some improvements. For
example, rdkit2fps now accepts --morgan1
through -morgan4
as
ways to set the Morgan radius (the old --morgan
is now an alias
for --morgan2
), and the click API makes it easier to group
--help
options.
Chemfp specifies click as a required dependency . If click is not installed then tools like pip should install it automatically.
In fact, this is chemfp’s first required dependency. (chemfp optionally supports NumPy, SciPy and other packages if they are installed, but those are not required for core operations).
In order to avoid install-time dependencies, chemfp 4.0 included a copy of the progress bar package tqdm with the distribution, which was used if tqdm was not already installed. With chemfp 4.1, tqdm is the second required dependency, and the vendored copy of tqdm has been removed.
CXSMILES¶
Chemfp 4.1 by default now reads SMILES file as containing CXSMILES. A line of a CXSMILES file looks like:
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example cxsmiles structure
with the SMILES followed by an optional CXSMILES extension followed by
the title or additional columns. There must be a single space between
the SMILES and the CXSMILES extension, and the extension must start
and end with the pipe character |
.
Use --no-cxsmiles
to parse the record as a SMILES. This keeps any
|
term as part of the record id/title. (The default is
--cxsmiles
, which parses the input as CXSMILES.
These options are equivalent to the reader arg -R cxsmiles=0
(to
disable CXSMILES parsing) and -R cxsmiles=1
(to enable), though
they are processed after the -R
option.
An alternative method to separate the structure from the identifier is
to use tab-separated fields and specify ---delimiter tab
.
Toolkit-specific details¶
CXSMILES support proved tricky to handle. For one, there is no easy way to identify the end of a CXSMILES extension without a full CXSMILES parser. Simply looking for a pipe character followed by a space isn’t enough, because a CXSMILES can recursively embed other CXSMILES, which can in turn contain spaces.
In addition, the underlying cheminformatics toolkits have different levels of support for the extensions, and don’t all handle identifiers the same way that chemfp does.
Open Babel does not support CXSMILES. Instead, chemfp uses hueristics to attempt to identify and remove the extension, and pass only the SMILES to Open Babel for processing.
CDK expects CXSMILES by default, with no option to disable the option. If
--no-cxsmiles
is specified then chemfp will extract the SMILES from the rest of the line and have CDK only parse the SMILES.RDKit has the option to enable or disable CXSMILES processing, but doesn’t handle identifiers the same way chemfp does, so chemfp tries to process the identifier and pass the rest to RDKit.
OEChem uses OEFormat_SMI for SMILES files and OEFormat_CXSMILES for CXSMILES files. Chemfp will set the appropriate OEChem format flag depending on the cxsmiles option.
Chemfp’s “text” toolkit is not a cheminformatics-based toolkit. It can only parse records from a SMILES and SDF files and, for SMILES, figure out which is the SMILES and which is the identifier. If the “cxsmiles” reader argument is true then it will use the same hueristics as the Open Babel parser to identify the end of the CXSMILES extension.
CXSMILES output¶
The toolkit wrappers can write SMILES files. By default they do not
include the CXSMILES extensions. Pass the writer argument
cxsmiles=True
to save with the CXSMILES extensions.
with T.open_molecule_writer(
"ouput.smi", writer_args={"cxsmiles": True}) as writer:
writer.write_molecules(mols)
New structure format specifiers¶
Chemfp 4.1 added the “cxsmi”, “cxsmistring”, and “sdf3k” format specifiers.
The “cxsmi” format can be used to read and write files in SMILES format. It is the same as the “smi” format except that “cxsmi” writes CXSMILES by default while “smi” does not. The “cxsmi” format is the default for filenames which use the “.cxsmi” extension.
The “cxsmistring” format parses a SMILES string with optional CXSMILES extension, but does not process any identifier which may be present.
The “sdf3k” format is the same as the “sdf” format except that “sdf3k” always writes records in V3000 format, while “sdf” only uses V3000 if required. The “sdf3k” format is the default for fileanmes which use the “.sdf3k” extension.
New SearchResult(s) attributes¶
The SearchResults
class now has the following new
attributes:
num_bits: the number of bits in the fingerprint
alpha: the Tversky alpha value, which is 1.0 for Tanimoto
beta: the Tversky beta value, which is 1.0 for Tanimoto
matrix_type: an integer enum value describing the matrix type (see below)
matrix_type_name: a string name describing the matrix type
target_start: the start position in the target arena (non-zero only if the targets was a subarena)
These were added so the SearchResults could be saved and loaded from an npz file without losing important chemfp metadata.
Previously the target_ids
attribute returned all of the ids for
the target areana, even if the targets was a subarena slice of the
main arena. This is useful because the search result indices are given
in arena coordinates, not in subarean coordinates.
However, it proved confusing when, for example, wanting a list of just the relevant target ids.
In chemfp 4.1 the target_ids
now returns only the relevant target
identifiers, and the new target_arena_ids
returns the full
identifiers from the target arena.
The matrix_type value is from a new MatrixType
enum in the
chemfp.search
module:
>>> from chemfp.search import MatrixType
>>> list(MatrixType)
[<MatrixType.NxM: 0>, <MatrixType.NxN: 1>,
<MatrixType.NO_DIAGONAL: 2>, <MatrixType.UPPER_TRIANGULAR: 3>,
<MatrixType.STRICTLY_UPPER_TRIANGULAR: 4>]
It describes the expected content of the underlying matrix type. These are used in the new Butina implementation as a check that an imported npz file or csr matrix is the expected type. The values are:
“NxM” - can contain a score for any combination of query and target
“NxN” - the same, except the queries and targets are identical
“NO-DIAGONAL” - same as “NxN” except the diagonal is not included
“UPPER-TRIANGULAR” - same as “NxN” except limited to the upper triangle
“STRICTLY-UPPER-TRIANGULAR” - “upper-triangular” but without the diagonal
The corresponding matrix_type_name
values are “NxM”, “NxN”,
“no-diagonal”, “upper-triangular”, and
“strictly-upper-triangular”.
>>> results.matrix_type
<MatrixType.NxM: 0>
>>> results.matrix_type_name
'NxM'
Chemfp currently does not generate an “upper-triangular” matrix type.
The SearchResult
also has the new target_start
and
target_ids
properties, matching the SearchResults.
Save simsearch results to “npz” files¶
Similarity search results can now be saved as a NumPy npz files, as a SciPy sparse csr matrix. This makes it easier to import search results into other analysis tools.
For example, I’ll do an NxN threshold search of a set of benzodiazepine fingerprints, with a cutoff of 0.4, and save the results to benzodiazepine.npz:
% simsearch --NxN -t 0.4 ../benzodiazepine.fps.gz -o benzodiazepine.npz
queries: 100%|██████████████████████████████████████| 12386/12386 [00:00<00:00, 32889.98 fps/s]
From the Python shell I’ll import the output file into SciPy and show the target indices and scores for row index 686, which has 5 elements:
>>> from scipy import sparse
>>> arr = sparse.load_npz("benzodiazepine.npz")
>>> arr
<12386x12386 sparse matrix of type '<class 'numpy.float64'>'
with 9541836 stored elements in Compressed Sparse Row format>
>>> arr[686]
<1x12386 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
>>> arr[686].indices
array([ 744, 950, 6083, 77, 196], dtype=int32)
>>> arr[686].data
array([0.49090909, 0.48214286, 0.53225806, 0.49019608, 0.625 ])
The indices aren’t that useful on their own. You’ll likely want to map them back to the original identifiers, which you can do because chemfp stores them in the npz file as well.
This is possible because NumPy npz file is a zipfile containing NumPy arrays in “npy” format. By convention the SciPy package stores the sparse matrix information in the zipfile entries “format.npy”, “shape.npy”, “indptr.npy”, “indices.npy”, and “data.npy”, which you can see on the command-line using the zip program:
% unzip -l benzodiazepine.npz
Archive: benzodiazepine.npz
Length Date Time Name
--------- ---------- ----- ----
600 01-01-1980 00:00 chemfp.npy
38167472 01-01-1980 00:00 indices.npy
49676 01-01-1980 00:00 indptr.npy
140 01-01-1980 00:00 format.npy
144 01-01-1980 00:00 shape.npy
76334816 01-01-1980 00:00 data.npy
396480 01-01-1980 00:00 ids.npy
--------- -------
114949328 7 files
Chemfp adds two or three file entries to the list. The “chemfp.npy” contains chemfp-specific metadata, which I’ll get to later. If the file “ids.npy” exists then it contains an array of identifiers used for both the queries and the targets. The file entries “query_ids.npy” and “target_ids.npy” store the query and target identifiers, respectively, but the are not used here because an NxN search has the same ids for the queries and targets.
The NumPy load function returns an NpzFile object with ways to get a list of the available files and load the contents. I’ll show how that can be used to get the target ids for ths hits of row 686:
>>> import numpy as np
>>> z = np.load("benzodiazepine.npz")
>>> z.files
['chemfp', 'indices', 'indptr', 'format', 'shape', 'data', 'ids']
>>> ids = z["ids"]
>>> ids
array(['76175', '2802905', '18631850', ..., '16198616', '16153475',
'11979873'], dtype='<U8')
>>> len(ids)
12386
>>> [ids[i] for i in arr[686].indices]
['21327973', '21327970', '13106808', '12991903', '21817691']
>>> z.close()
(The explicit NpzFile close ensures the underlying zipfile gets closed at that point, rather than depend on Python’s garbage collector. The NpzFile is also a context manager, which means you could use it in a “with” statement instead.)
chemfp.npy file entry in an npz file¶
The “chemfp.npy” entry in the npz file contains chemfp-specific metadata. Its a JSON dictionary, stored as a string, in a NumPy array, encoded in “npy” format, in a zipfile. While that sounds complicated, its meant to be relatively easy to load using NumPy’s load command, which the previous section used to load the array identifiers:
>>> import json
>>> import numpy as np
>>> with np.load("benzodiazepine.npz") as z:
... d = json.loads(z["chemfp"].item())
...
>>> import pprint
>>> pprint.pprint(d)
{'alpha': 1.0,
'beta': 1.0,
'format': 'multiple',
'matrix_type': 'no-diagonal',
'method': 'Tversky',
'num_bits': 2048}
The most important of these is format, which can be “single” or
“multiple”. It is “single” if the npz file stores a single chemfp
SearchResult
, and “multiple” if it stores a chemfp
SearchResults
.
The method indicates the scores were calculated with Tversky similarity, using the alpha and beta values of 1.0. This setting is much better known as “the Tanimoto”.
The matrix_type describes the expected matrix type, described previously.
The num_bits is the number of bits in the fingerprints used to
generate the Tanimoto/Tversky scores. Chemfp uses it in two
places. First, it determines the default number of digits to display
when printing a Tanimoto score as a decimal. All 166-bit fingerprints
can be distinguished with only 5 bits of precision, while a 2048-bit
fingerprint needs 7. These values are available from the
get_tanimoto_precision()
function:
>>> import chemfp.bitops
>>> chemfp.bitops.get_tanimoto_precision(166)
5
>>> chemfp.bitops.get_tanimoto_precision(2048)
7
Second, Tanimoto scores don’t use the full range of the 64-bit double that chemfp uses internally to store the score. If the number of fingerprint bits is small enough then it’s possible to sort the numbers by only comparing parts of the bit-representation of the score, which improves the overall sort performance. Chemfp also doesn’t use special values like the 64-bit double representations for -0.0, infinite, or “NaN”/”not a number” values, which further improves perfomance.
If you plan to generate Tanimoto similarity scores from another tool, and import them into chemfp then you should set this value appropriately. If the scores come from another sort of similarity calculation, or don’t know how the number of bits affects things, then it is safe to use the value 2**31-1, which is the default chemfp uses if num_bits is not specified.” NaN and infinity are not supported.
Working with npz files in chemfp¶
The new load_npz()
and save_npz()
functions in the
chemfp.search
module will load and save, respectively
similarity search results to an npz file, along with the identifiers
and the “chemfp.npy” metadata.
The search results can be a SearchResults
returned when
using multiple queries against multiple targets, or a single
SearchResult
returned when using a single fingerprint to
search the targets, and returned as a row of a SearchResults. They can
also be the high-level objects returned from chemfp.simsearch()
,
though only the low-level SearchResult or SearchResults are loaded,
without the timing information and other metadata in the high-level
objects:
>>> import chemfp
>>> results = chemfp.simsearch(queries="queries.fps", targets="targets.fps", k=5)
targets.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 33.7Mbytes/s]
queries.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 75.7Mbytes/s]
queries: 100%|█████████████████████████████████████| 100/100 [00:00<00:00, 63138.70 fps/s]
>>> results
MultiQuerySimsearch('5-nearest Tanimoto search. #queries=100,
#targets=100 (search: 1.87 ms total: 39.59 ms)', result=SearchResults(#queries=100, #targets=100))
>>>
>>> import chemfp.search
>>> chemfp.search.save_npz("k5.npz", results)
>>> k5_results = chemfp.search.load_npz("k5.npz")
>>> k5_results
SearchResults(#queries=100, #targets=100)
You can also save the results using the new save()
method for the
similarity search result objects:
>>> results.save("k5.npz")
Importing SciPy csr matrices to chemfp¶
The new from_csr()
function in the chemfp.search
module
converts a SciPy csr (“compressed sparse row”) array into a chemfp
SearchResults
instance. In the following I’ll start with a
regular NumPy array, turn it into a sparse matrix, then convert that
to a SearchResults:
>>> import numpy
>>> import scipy.sparse
>>> import chemfp.search
>>> arr = numpy.array([[0.0, 0.0, 0.0, 0.9, 0.8], [0.7, 0.0, 0.0, 0.6, 0.0]])
>>> arr
array([[0. , 0. , 0. , 0.9, 0.8],
[0.7, 0. , 0. , 0.6, 0. ]])
>>> csr = scipy.sparse.csr_array(arr)
>>> csr
<2x5 sparse array of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> sr = chemfp.search.from_csr(csr)
>>> sr
SearchResults(#queries=2, #targets=5)
>>> sr[0].get_indices_and_scores()
[(3, 0.9), (4, 0.8)]
>>> sr[1].get_indices_and_scores()
[(0, 0.7), (3, 0.6)]
Additional from_csr parameters let you specify the query and target ids, matrix type, the number of fingerprint bits, and the Tversky alpha and beta parameters. See the warnings earlier about how num_bits affects chemfp.
Butina clustering¶
Chemfp 4.1 includes Butina clustering, both from the command-line and as a high-level API function. More complete examples of both are given below. Here’s an idea to get you started, first on the command-line:
% chemfp butina chembl_30.fpb --threshold 0.7 -o chembl_30.centroids
and then using the Python API:
import chemfp
clusters = chemfp.butina(fingerprints="chembl_30.fpb", NxN_threshold=0.7)
print(f"#clusters: {len(clusters)}")
clusters.save("chembl_30.centroids")
The basic Butina clustering algorithm is from: Darko Butina, “Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets”, Journal of Chemical Information and Computer Sciences 1999 39 (4), 747-750 DOI: 10.1021/ci9803381.
(This is sometimes referred to as “Taylor” or “Taylor-Butina” clustering due to Robin Taylor, “Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals”, Journal of Chemical Information and Computer Sciences 1995 35 (1) 59–67. DOI: 10.1021/ci00023a009. The methods are similar, but Taylor did not apply it to clustering.)
In short, compute the NxN similarity matrix. For each fingerprint, count the number of neighbors which are within some minimum similarity threshold. Sort those by count, from most neighbors to least. Pick the first fingerprint as the first cluster center, and include all of its neighbors as cluster members. Iteratively repeat until done for all remaining fingerprints which are not already assigned as members of another cluster.
Chemfp supports several variants of the Butina algorithm: 1) breaking ties in the Butina ordering, 2) dealing with false singletons, and 3) pruning the final number of clusters.
tiebreaker¶
The initial ordering by count, from largest to smallest, nearly always results in ties, where many fingerprints have the same number of neighbors. How are ties broken?
By default chemfp will “randomize” the order, using an initial seed from Python’s random number generator, which means each time you run the algorithm you are likely to get different results. You can specify the seed if you want consistent output, or you can choose to break ties by order in chemfp’s internal fingerprint arena, either “first” or “last”. The fingerprints are ordered from the smallest number of bits to the largest, so “first” will bias the initial pick towards small molecules, while “last” will bias it towards larger ones with more internal diversity.
Here’s an example how to specify the tiebreaker on the command-line:
% chemfp butina dataset.fps --tiebreaker first
and an example using the API, which also sets the seed:
import chemfp
chemfp.butina("dataset.fps", tiebreaker="randomize", seed=12345)
false-singletons¶
If there is only one fingerprint in the cluster, then that fingerprint (which is the cluster center) is termed a “singleton.” A “false singleton” is a singleton with one or more neighbors within the minimum similarity threshold but where those neighbors were all assigned to one or more other clusters.
Chemfp supports three ways to deal with a false singleton. The default is “follow-neighbor”: move the false singleton to the same cluster as its first nearest-neighor. (If there are multiple equally-nearest neighbors then currently chemfp makes an arbitrary, implementation-defined choice. A future version may break ties using a random choice.)
The second option is to “keep” the false singleton as a singleton.
The third is to move the false singleton to the cluster with the “nearest-center”. (Ties are currently broken by an arbitrary, implementation-defined choice. A future version may break ties using a random choice.) This method was described by Niklas Blomberg, David A. Cosgrove, and Peter W. Kenny, JCAMD 23, pp 513–525 (2009) doi:10.1007/s10822-009-9264-5 though chemfp’s implementation does not yet support a user-specified minimum required center threshold.
The “nearest-center” variant requires the original fingerprints in order to find the nearest cluster center. The other two options only require the NxN similarity matrix.
Here’s an example of keeping false singletons using the command-line
tool, and the --fs
short form of writing --false-singletons
:
% chemfp butina chembl_30.fpb --threshold 0.7 --fs keep
Here’s what it looks like in the Python API:
import chemfp
clusters = chemfp.butina("chembl_30.fpb", NxN_threshold=0.7, false_singletons="keep")
clusters.save(None) # write to stdout
Pruning¶
Chemfp also implements a post-processing step to reduce the number of clusters. The clusters are ordered by size, from smallest to largest. The smallest cluster is selected, with ties broken by selecting the first created cluster. Each fingerprint in the cluster is processed (currently from last to first) to find its nearest-neighbor in another cluster, with the fingerprint added to that cluster before processing the next fingerprint in the smallest cluster. (If fingerprint X moved to cluster C and fingerprint Y’s nearest neigbhor is X then Y wil also be moved to cluster C.)
The post-processing step requires the original fingerprints and cannot be done with the NxN similarity matrix.
This algorithm was devised by a chemfp customer, who also funded its integration into chemfp.
Here’s a command-line example of pruning a dataset down to 1,000 clusters:
% chemfp butina dataset.fpb -t 0.6 -n 1000 -o dataset.centroids
and the equivalent version using the API:
import chemfp
with chemfp.butina("dataset.fpb", NxN_threshold=0.6, num_clusters=1000) as clusters:
clusters.save("daset.centroids")
Other options¶
If fingerprints are available then the default will rescore moved fingerprints (ie, false singletons moved to a “nearest-center”, and fingerprints moved by the post-processing pruning step) to their cluster center. Rescoring can be disabled.
By default the output clusters are ordered and numbered so the largest cluster (after any moves) are first, though there’s an option to show the original Butina ordering, which chemfp uses internally. By default the cluster center is labeled “CENTER” and the others labeled “MEMBER”. If renaming is not used then other labels are “FALSE_SINGLETON” (for singletons which are cluster centers), “MOVED_FALSE_SINGLETON” for false singletons which were reassigned to another cluster), and “MERGED” for fingerprints which were moved to another cluster as part of the post-processing cluster pruning step.
Butina on the command-line¶
The Butina algorithm is available as a subcommand of chemfp. Using a small set of fingerprints as an example:
% cat distinct.fps
1200ea4311018100 id1
894dc00a870b0800 id2
82b180a0943d0254 id3
d04a450013d8a479 id4
8025719c2c907c2e id5
918125cf96a2a360 id6
6b0d4215ca47fc03 id7
2eb000ce2e4337ef id8
849a6de873554c59 id9
b67121e96f48a365 id10
2ab0f5b64cb4b8cf id11
e7fedcefa4e7560e id12
I’ll cluster at a threshold of 0.41:
% chemfp butina distinct.fps -t 0.41
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
i center_id count members
1 id8 5 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256
2 id9 2 id9 1.0000 id4 0.4103
3 id3 1 id3 1.0000
4 id7 1 id7 1.0000
5 id1 1 id1 1.0000
6 id2 1 id2 1.0000
7 id6 1 id6 1.0000
This output follows the usual chemfp output format style where the first line describes the format, followed by header lines starting with ‘#’, then the output data proper, using tab-separated columns.
The default uses a variable number of columns, with one line per cluster. The first column is the column id (an integer), the second is the fingerprint id for the center fingerprint, and the third is the number of elements in the cluster (including the center).
After the count are the id and score of every fingerprint in the cluster, starting with the cluster center.
Pruning the number of clusters¶
Use --num-clusters
or simply -n
option to prune the clusters
down to a smaller value. In the following I’ve specified I want only 3
clusters, so four of the clusters will be pruned, and their
fingerprints moved to other clusters:
% chemfp butina distinct.fps -t 0.41 --num-clusters 3
#Centroid/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=3601395311 false-singletons=follow-neighbor num-clusters=3 rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
i center_id count members
1 id8 8 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256 id3 0.2381 id4 0.1702 id9 0.2917
2 id6 2 id6 1.0000 id1 0.2353
3 id7 2 id7 1.0000 id2 0.2973
Alternate output formats¶
Some other options include disabling the metadata in the output (the lines starting with “#”), only showing the cluster centers, and specifying the initial seed for random number generation:
% chemfp butina distinct.fps -t 0.41 --no-members --no-metadata --seed 19573998
i center_id count
1 id8 5
2 id9 2
3 id3 1
4 id7 1
5 id1 1
6 id2 1
7 id6 1
The “flat” variant of the centroid format writes one output line for each fingerprint, with the centroid id in the first column, the fingerprint id in the second, the cluster element label in the third, and the score in the fourth:
% chemfp butina distinct.fps -t 0.41 --out flat --seed 19573998
#Centroid-flat/1 include-members=1
#type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1
#software=chemfp/4.1
#fingerprints=distinct.fps
centroid id type score
5 id1 CENTER 1.0000
6 id2 CENTER 1.0000
3 id3 CENTER 1.0000
2 id4 MEMBER 0.4103
1 id5 MEMBER 0.3256
7 id6 CENTER 1.0000
4 id7 CENTER 1.0000
1 id8 CENTER 1.0000
2 id9 CENTER 1.0000
1 id10 MEMBER 0.4651
1 id11 MEMBER 0.4130
1 id12 MEMBER 0.4400
The -no-rename
option tells the “flat” output to use the full
internal labels for each fingerprint, which in this case shows that
“id5” was a false singleton moved to cluster 1, and “id1”, “id2”,
“id3”, and “id4” were moved by the pruning step:
% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
--seed 3601395311 --no-metadata --out flat --no-rename
centroid id type score
2 id1 MERGED 0.2353
3 id2 MERGED 0.2973
1 id3 MERGED 0.2381
1 id4 MERGED 0.1702
1 id5 MOVED_FALSE_SINGLETON 0.3256
2 id6 CENTER 1.0000
3 id7 CENTER 1.0000
1 id8 CENTER 1.0000
1 id9 MERGED 0.2917
1 id10 MEMBER 0.4651
1 id11 MEMBER 0.4130
1 id12 MEMBER 0.4400
The “csv” and “tsv” formats are comma- and tab-separated formats which are a hybrid between the “centroid” and “flat” formats. They have one output line for each fingerprint, ordered by cluster number (largest first) and further ordered by score, with the cluster center always first.
The output fields are the cluster id, the fingerprint id, the number
of elements in the cluster, the fingerprint type (“CENTER”, “MEMBER”,
and more if using --no-rename
), and the score.
% chemfp butina distinct.fps -t 0.41 --num-clusters 3 \
--seed 3601395311 --out csv
cluster,id,count,type,score
1,id8,8,CENTER,1.0000
1,id10,8,MEMBER,0.4651
1,id12,8,MEMBER,0.4400
1,id11,8,MEMBER,0.4130
1,id5,8,MEMBER,0.3256
1,id3,8,MEMBER,0.2381
1,id4,8,MEMBER,0.1702
1,id9,8,MEMBER,0.2917
2,id6,2,CENTER,1.0000
2,id1,2,MEMBER,0.2353
3,id7,2,CENTER,1.0000
3,id2,2,MEMBER,0.2973
The output was formatted using Python’s csv module, with the “excel” and “excel-tab” dialects, respectively, which means the results should be easy to import into Excel. The writer may use special quoting rules should the identifier contain a comma, newline, or other special character.
Butina parameter tuning¶
The Butina parameters almost certainly require some tuning to find the appropriate threshold, and to decide if cluster pruning is appropriate.
The Butina algorithm is quite fast. It takes just a few seconds, in the default case, to cluster the NxN similarity matrix from the 2+ million fingerprints in ChEMBL. Most of the time is spent formatting the output correctly!
On the other hand, even with chemfp it can take 30 minutes or even a couple of hours to generate the matrix in the first place - making tuning a practice in patience!
This is where the new “npz” format comes in handy. You can generate the NxN matrix once, using the lower-bound similarity threshold for your search space, and save the result to an npz file. This pre-computed matrix can then be used as input to Butina clustering. All chemfp need to do then is count the number of entries at or above the Butina threshold, which is must faster then computing milions of Tanimoto calculations.
(Note: currently only a “no-diagonal” NxN matrix is supported. A future version may allow a full NxN matrix. Also, the entries should be ordered by decreasing score. If they are not in that order, chemfp will reorder them.)
You can use the simsearch command to generate the npz file but it’s
easier to use the --save-matrix
option of the Butina command. In
the following I’ll write the output to /dev/null and I’ll show timing
details for the individual steps:
% chemfp butina --threshold 0.4 benzodiazepine.fps.gz \
--save-matrix benzodiazepine.npz --times -o /dev/null
load_fingerprints 0.06 NxN 0.50 save_matrix 2.79 cluster 0.00
rescore 0.00 output 0.01 total 3.43
What took the longest was the time to write the matrix to an npz file. The relatively low threshold of 0.4, combined with the high similarity of the benzodiazepines, mean there are 9,541,836 non-zero entries out of 153,400,610 possible non-zero values (the diagonals are not included), giving a density of 6.2%.
I can now cluster with the saved NxN matrix instead of using fingerprints, first, using all of the entries in the matrix:
% chemfp butina benzodiazepine.npz --times -o /dev/null
load_matrix 0.25 cluster 0.00 output 0.01 total 0.34
and then using --butina-threshold
to change the minimum similarity
threshold for the Butina algorithm:
% chemfp butina benzodiazepine.npz --times -o /dev/null \
--butina-threshold 0.5
load_matrix 0.25 cluster 0.00 output 0.02 total 0.35
Number of Butina clusters in ChEMBL¶
Those timing numbers probably don’t look that impressive for the small benzodiazepine data set. I’ll switch to “chembl_30_60.npz”, which contains the NxN similarity matrix for CheMBL 30 at threshold of 0.60 using RDKit’s default Morgan2 fingerprints, and see how the number of clusters changes for for 0.60, 0.61, 0.62, … 0.70. I can get this number by looking at the last line of the output. I’ll also show the breakdown of the times for each clustering run:
% chemfp butina chembl_30_60.npz --butina-threshold 0.6 --times | tail -1
load_matrix 0.77 cluster 0.35 output 3.11 total 4.51
327022 CHEMBL4790513 1 CHEMBL4790513 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.61 --times | tail -1
load_matrix 0.73 cluster 0.37 output 2.96 total 4.35
354362 CHEMBL1980515 1 CHEMBL1980515 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.62 --times | tail -1
load_matrix 0.75 cluster 0.39 output 2.96 total 4.40
377894 CHEMBL1477846 1 CHEMBL1477846 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.63 --times | tail -1
load_matrix 0.74 cluster 0.42 output 3.00 total 4.47
405647 CHEMBL1497171 1 CHEMBL1497171 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.64 --times | tail -1
load_matrix 0.78 cluster 0.45 output 3.12 total 4.67
433390 CHEMBL287360 1 CHEMBL287360 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.65 --times | tail -1
load_matrix 0.75 cluster 0.45 output 3.05 total 4.56
464418 CHEMBL3472504 1 CHEMBL3472504 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.66 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4235635.50/s]
load_matrix 0.75 cluster 0.50 output 3.22 total 4.79
499199 CHEMBL1387287 1 CHEMBL1387287 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.67 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4124939.25/s]
load_matrix 0.74 cluster 0.52 output 3.13 total 4.70
536382 CHEMBL471678 1 CHEMBL471678 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.68 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4162832.16/s]
load_matrix 0.74 cluster 0.51 output 3.17 total 4.73
573835 CHEMBL1402501 1 CHEMBL1402501 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.69 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4030396.68/s]
load_matrix 0.74 cluster 0.53 output 3.24 total 4.81
617398 CHEMBL4552562 1 CHEMBL4552562 1.0000000
% chemfp butina chembl_30_60.npz --butina-threshold 0.70 --times | tail -1
elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 3597378.39/s]
load_matrix 0.76 cluster 0.59 output 3.34 total 4.99
657984 CHEMBL1724866 1 CHEMBL1724866 1.0000000
The progress bars appear once clustering takes more than 0.5
seconds. You can use --progress
or --no-progress
to always
enable or diable them, or set the environment variable
“CHEMFP_PROGRESS” to “1” or “0”.
Each of these clustering jobs took about 5 seconds to run, with most of the time spent writing the output (!). That’s a lot of overhead when I only want the count.
High-level Butina API¶
You can also do Butina clustering using chemfp’s Python API, using the
new chemfp.butina()
function, as in the following:
>>> import chemfp
>>> clusters = chemfp.butina(matrix="chembl_30_60.npz" , butina_threshold=0.70, progress=False)
>>> clusters
ButinaClusters('2136187x2136187 matrix, butina-threshold=0.7
tiebreaker=randomize seed=4095063231
false-singletons=follow-neighbor, took 694.85 ms', result=[657865
clusters], picker=...)
There’s a lot going on there! I loaded the npz file to use as the search matrix, then carried out a Butina clustering with a threshold of 0.7. I also disabled the progress bars, which currently don’t have the 0.5 second delay.
The ButinaClusters includes timing information. This analysis did not save the results, so it was much faster at about 0.7 seconds. For a more detailed description, see
The returned ButinaClusters
repr(), which is
designed for interactive use like this, summarizes the analysis
parameters, and gives a list-like API to the 657,870 clusters found.
>>> clusters.butina_threshold
0.7
>>> clusters.seed
4095063231
>>> clusters[0]
ButinaCluster(cluster_idx=0, #members=1540)
>>> clusters[0].members[:10]
[2123772, 2119185, 2119187, 2119188, 2119190, 2119192, 2119199, 2119202, 2119206, 2119215]
>>> [m.target_ids[i] for i in clusters[0].members[:3]]
['CHEMBL3958999', 'CHEMBL3907845', 'CHEMBL3908293']
The API makes it easy (and significantly faster!) to produce the parameter scan of the previous section. One optimization is to load the matrix once, so it can be re-used for each of the butina calls:
>>> import chemfp.search
>>> m = chemfp.search.load_npz("chembl_30_60.npz")
>>> m
SearchResults(#queries=2136187, #targets=2136187)
>>> m.matrix_type
<MatrixType.NO_DIAGONAL: 2>
>>> min(row.min() for row in m if row)
0.6
>>> max(row.max() for row in m if row)
1.0
>>> chemfp.butina(matrix=m, progress=False)
ButinaClusters('2136187x2136187 matrix, tiebreaker=randomize
seed=2054201120 false-singletons=follow-neighbor, took 513.50 ms',
result=[327076 clusters], picker=...)
With that in place, I’ll count the number of Butina clusters for every threshold from 0.6 to 0.7:
>>> import time
>>> if 1:
... start_time = time.time()
... for threshold in range(60, 71, 1):
... t = threshold/100.0
... print(t, len(chemfp.butina(matrix=m, butina_threshold=t, progress=False)))
... print("Elapsed time:", time.time()-start_time)
...
0.6 326954
0.61 354204
0.62 377855
0.63 405589
0.64 433320
0.65 464492
0.66 499330
0.67 536273
0.68 573910
0.69 617330
0.7 657888
Elapsed time: 6.543123960494995
That’s significantly faster!
Use the ButinaClusters.save()
method to save the results to a
file. The output format can be specified directly, or inferred by the
filename extension:
>>> clusters.save("output.txt")
>>> open("output.txt").readline()
'#Centroid/1 include-members=1\n'
>>> clusters.save("output.flat")
'#Centroid-flat/1 include-members=1\n'
>>> clusters.save("output.txt", format="flat")
>>> open("output.txt").readline()
'#Centroid-flat/1 include-members=1\n'
>>> clusters.save("output.csv")
>>> open("output.csv").readline()
'cluster,id,count,type,score\n'
Butina clustering with fingerprints¶
When fingerprints are passed to the butina function, it will generate the NxN similarity matrix at the given threshold (default: 0.7) before doing the clustering:
>>> with chemfp.butina(fingerprints="benzodiazepine.fps.gz", NxN_threshold=0.9) as clusters:
... print(f"#clusters: {len(clusters)}")
... print(f"#elements in largest cluster: {len(clusters[0])}")
... ids = clusters.matrix.target_ids
... print("ids:", ", ".join(ids[i] for i in clusters[0]))
...
#clusters: 9845
#elements in largest cluster: 13
ids: 9806545, 15978353, 18346686, 18346717, 21941544, 9914920,
21941549, 23376564, 9806982, 9828811, 9850353, 21941556, 21941589
Some of the Butina clustering variants require fingerprints and cannot work with only a similarity matrix. These can both be passed to the function, in this case using filenames:
>>> chemfp.butina(matrix="benzodiazepine.npz", false_singletons="nearest-center")
Traceback (most recent call last):
...
ValueError: Cannot use false_singletons='nearest-center' without fingerprints
>>> chemfp.butina(matrix="benzodiazepine.npz", fingerprints="benzodiazepine.fps.gz",
... false_singletons="nearest-center")
ButinaClusters('12386 fingerprints, NxN-threshold=0.7
tiebreaker=randomize seed=3668915671 false-singletons=nearest-center
rescore=1, took 333.22 ms', result=[314 clusters], picker=...)
Butina clusters and pandas¶
The ButinaClusters.to_pandas()
method for the clusters returned
by the chemfp.butina()
function exports the fingerprint details
to a pandas data frame, with one row per fingerprint:
>>> import chemfp
>>> clusters = chemfp.butina(matrix="benzodiazepine.npz")
>>> clusters.to_pandas()
cluster id type score
1632 1 10572033 CENTER 1.000000
1720 1 18968942 MEMBER 1.000000
5632 1 18665677 MEMBER 0.814815
5637 1 18968923 MEMBER 0.814815
4441 1 10223808 MEMBER 0.811321
... ... ... ... ...
9995 225 20360404 CENTER 1.000000
9645 225 20360405 MEMBER 0.984127
11260 226 20360429 CENTER 1.000000
11069 226 20360430 MEMBER 0.985294
12385 227 11979873 CENTER 1.000000
[12386 rows x 4 columns]
The method takes several parameters, including the column titles to use, and the choice to rename the type to “CENTER” and “MEMBER” or to keep the internal names, like the following which shows only the moved false singletons:
>>> clusters.to_pandas(columns=["i", "ID", "type", "score"],
... rename=False).query("type == 'MOVED_FALSE_SINGLETON'")
i ID type score
7097 1 16001255 MOVED_FALSE_SINGLETON 0.786885
2421 1 10980601 MOVED_FALSE_SINGLETON 0.711538
12088 1 20711717 MOVED_FALSE_SINGLETON 0.636364
12341 1 9854335 MOVED_FALSE_SINGLETON 0.631579
3046 1 20063249 MOVED_FALSE_SINGLETON 0.629630
... ... ... ... ...
12210 118 21797426 MOVED_FALSE_SINGLETON 0.509091
7844 124 25262692 MOVED_FALSE_SINGLETON 0.690141
7245 144 21489312 MOVED_FALSE_SINGLETON 0.588235
3526 149 19865801 MOVED_FALSE_SINGLETON 0.560606
7302 162 22153741 MOVED_FALSE_SINGLETON 0.746032
[86 rows x 4 columns]
If you want to be really hard-core, you can dig down to the low-level fingerprint assignments:
>>> clusters.assignments.to_pandas()
assignment_type cluster_idx score
0 2 11 0.477273
1 2 11 0.413043
2 2 11 0.500000
3 2 11 0.500000
4 2 11 0.404255
... ... ... ...
12381 2 121 0.402878
12382 2 121 0.402878
12383 2 134 0.493243
12384 2 134 0.439490
12385 1 226 1.000000
[12386 rows x 3 columns]
or work with them directly, in-place, as NumPy array with a complex dtype:
>>> clusters.assignments.as_numpy()
array([(2, 11, 0.47727273), (2, 11, 0.41304348), (2, 11, 0.5 ),
..., (2, 134, 0.49324324), (2, 134, 0.43949045),
(1, 226, 1. )],
dtype={'names': ['assignment_type', 'cluster_idx', 'score'],
'formats': ['<i4', '<i4', '<f8'], 'offsets': [0, 4, 8],
'itemsize': 16, 'aligned': True})
This part of the API is still under development and aspects may change in the next release.
Changed default output format name¶
The command-line tools simsearch, maxmin, spherex, and heapsweep, have a new default format name. The default output format hasn’t changed, only its name, specified by the –out option or the filename extension. The previous default format name was “chemfp”. The new defaults are:
“simsearch” for simsearch
“diversity” for maxmin and heapsweep
“spherex” for spherex
These names corresponding to the name used in the first line of their respective outputs.
This change paves the way for supporting new output formats, without having “chemfp” mean different things depending on the context.
For backwards compatibility, the old “chemfp” format is still recognized. It should not be used in new code.
Output metadata options¶
The chemfp formats generally start with “metadata” lines which describe the output format, the parameters used to generate the output, the relevant software versions, and so on. They are meant to provide data provenance, while being easy to filter out - the metadata lines all start with a “#” while the data lines do not.
If this provenance data is not needed, for example, in a data pipeline
where the results are immediately consumed by another program, then
use the new --no-metadata
option to omit metadata from the
output. The default is --include-metadata
.
% echo "C methane" | cdk2fps --MACCS --no-metadata
000000000000000000000000000000000000008000 methane
The chemfp.open_fingerprint_writer()
now supports an
include_metadata parameter. If False, the header metadata is not
included in the FPS output.
>>> import chemfp
>>> with chemfp.open_fingerprint_writer(None, include_metadata=False) as writer:
... writer.write_fingerprint("AB123", b"\xfa\x49")
...
fa49 AB123
There are new command-line for the “date” line in the metadata. The default saves the current time (in UTC) to the output, which means the output is not bytewise reproducible, which is sometimes useful.
The new --date
option lets you specify the date to use, as an ISO
date string. The new --no-date
omits the date line from the
metadata.
NOTE: A future version of chemfp will support local time zones in FPS files. This will use the zoneinfo module added to Python 3.9, with fallback to tzdata for Python 3.8. (Note: Python 3.8 end-of-life for Python core maintainer support is October 2024.)
spherex changes¶
There are several changes related to chemfp’s spherical exclusion implementation, called “spherex”.
Multi-threaded¶
For each iteration, the spherex algorithm first picks a fingerprint (using one of several methods) then finds all other remaining fingerprints which are sufficiently close to the newly picked fingerprint, and assigns them to a new cluster.
In chemfp 4.1 the “finds all other remaining fingerprints” step has been parallelized using OpenMP.
The default number of threads depends on the OpenMP configuration, and likely the number of available cores. I’ve found that you can use several times more threads than cores and still get useful speedup, so if performance is important you should try larger values.
As with simsearch, you can specify the number of threads using the
OMP_NUM_THREADS environment variable. You can also set it on the
spherex command-line using the --num-threads
/-j
option, and with the chemfp.spherex()
API using the
num_threads
parameter.
Parallelism introduces some non-determinism in the order that sphere
members are added to a cluster. The diversity result classes now
implement PicksAndScores.reorder()
and
Neighbors.reorder()
with the same orderings as
SearchResult.reorder()
, which can be used to sort the list by
decreasing score, and generate a canonical ordering.
The new PicksAndScores.move_pick_index_to_first()
function
moves the given member to the first element. Together with the reorder
function these can be used to put the list in canonical order, with
the cluster center as the first element.
Output format¶
By default the spherex command generates output in “spherex” format (previously called “chemfp” format). While the format hasn’t changed, there are additional guarantees about the default output order. The sphere center is always the first element in the members list (if present), and the remaining members are always given in a canonical order, sorted by decreasing score.
Spherex now also supports the “centroid” output format, matching the same format for Butina clustering. The idea is to make it easier to switch between the two algorithms.
In this release “centroid” output by default only includes the picked
sphere centers. Use --include-members
to include all of the sphere
members in the output. NOTE! the default “centroid” output from
chemfp butina is currently different than the one from
chemfp spherex. The butina output includes both the center
and the members, and requires the --no-members
option to only
output the centers.
The long-term goal is to make these the same, where the default
includes the members. For future compatibility, please use
--no-members
if you really want the current default.
The --include-members
and --no-members
options are aliases for
the older --include-hits
and --no-hits
options. The older
options are currently supported, for backwards compatibility, but
hidden from the --help
. They should not be used and will likely
eventually be removed.
spherex does not yet implement the “flat” centroid format of the butina command.
New ranking formats¶
There are two new ways to specify fingerprint ranks for directed
sphere exclusion when using the --ranks
option. Instead of
specifying a rank order or weight they order the picks by file
position - the first pick is the first id in the file, the second pick
is the next available id, and so on.
The “txt” format contains one line per id, with an optional
header. Use --ranks-has-header
if the first line is a header.
Alternatively you can specify a fingerprint file, which will be processed in input order. (As a reminder, by default FPB files are sorted in popcount order. It’s possible to store the FPB file in input order using the fpcat command-line tool, but in that case it’s probably better to use the original FPS directly.)
The fingerprint ranking option exists for cases where the fingerprints
in the candidate file are in rank order. In that case, the file is
specified once for the candidates and again for --ranks
:
chemfp spherex dataset.fps --ranks dataset.fps
csv background¶
Here is some background about what “csv” means in a chemfp context. It sets the stage for the next section.
The term csv comes from the initialism for “Comma-Separated Values”, but is also used to describe several other related formats, like tab-separated values. I will use to mean the files that can be parsed with Python’s csv module.
In chemfp, the “csv” format is the dialect of the comma-separated format following the quoting rules the Excel uses. (Quoting rules are used to, for example, allow a comma in a field without it being intepreted as a separated.) It is the same as Python’s “excel” dialect, which chemfp accepts as an alias.
In chemfp, the “tsv” format is the dialect of the tab-separated format following the quoting rules the Excel uses. It is the same as Python’s “excel-tab” format, which chemfp accepts as an alias.
Python’s csv module lets you specify your own csv dialect, with options for which delimiter character to use, how escaping and quoting work, and whether or not initial spaces should be ignored. These are:
delimiter: The character used to separate fields
escapechar: On reading, the escapechar removes any special meaning from the following character
quotechar: A one-character string used to quote fields containing special characters
quoting: Controls if the reader should recognize quoting
skipinitialspace: If True, ignore leading spaces in a field.
For details see the Python documentation,
The chemfp csv2fps command-line tool lets you specify these via command-line options.
It starts with the specified --dialect
(defaulting to “csv”), then
applies the other specified terms, like --separator
. Certain
characters are difficult to write on the command-line so the
--separator
, --quotechar
and --escapechar
also accept the
aliases “tab”, “backslash”, “space”, “quote”, “doublequote”,
“singlequote”, or “bang”.
The get_dialect()
function in the new chemfp.csv_readers
module is the equivalent in the Python API. The returned
CSVDialect
can be used with both the chemfp API and Python’s
csv module.
“sniffing”¶
The csv module has a way to “sniff” the dialect. It reads the start of the file and uses a number of heuristics to determine the dialect parameters, and to determine if the first line is a header.
Chemfp implements the special “sniff” dialect which will use these method to get the (hopefully) correct settings.
My experience has been a bit iffy, and it strongly depends on the dataset you have. It’s best used with chemfp csv2fps --describe to get an initial idea of the actual format than to use it as the default dialect.
MolPort csv file¶
Some datasets are in csv format with a structure (usually SMILES or InChI) stored in one of the columns, and an identifier stored in another. One such example is the MolPort distribution, which I’ll use for these examples.
It is a tab-delimited format with a header line containing titles, followed by the data. Here’s an example, which unfortunately overflows the available display space, so I’ve trimmed it the first 70 or so characters:
% gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | head -2
SMILES SMILES_CANONICAL MOLPORTID STANDARD_INCHI INCHIKEY...
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 Cl.CCC(C(=O)OC1CC2CCC(C...
That’s still enough to see there are a couple of SMILES fields, a MolPort id, and an InChI. You can also see it’s a tab-separated file, though I’ve converted them to multiple spaces for this output.
There are a lot of tools for working with CSV files, including the popular xsv and its variants. Those are very fast and flexible, with ways to filter, join, and manipulate the data. For example, the following line for the Bash shell will select the SMILES and MOLPORTID columns (from a tab-delimited file), then re-format the comma-delimited output to tab-delimited, to get a tab-delimited SMILES file with a header:
% gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | \
xsv select SMILES,MOLPORTID -d $'\t' | xsv fmt -t $'\t' | head -3
SMILES MOLPORTID
Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 MolPort-009-675-630
CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1 MolPort-009-675-631
This uses the Bash’s ANSI-C quoting to insert the tab character. There are other ways to insert the tab character, depending on your shell and/or terminal emulator.
csv2fps command¶
While it’s not hard to convert a CSV file to SMILES for input to chemfp fingerprint generation, with chemfp 4.1 you can now process CSV files directly using the chemfp csv2fps command.
I’ll use “fulldb_smiles-009-000-000–009-499-999.txt.gz” from the MolPort distribution, but because that’s rather long I’ll refer to it using the “$FILENAME” environment variable.
csv2fps --describe¶
When getting started with a new CSV file, use the special
--describe
option to have chemfp describe what it thinks the file
contains. By default it parses the file as the “csv” dialect, that is
comma-delimited, with a header. It tries to show the titles and
columns for the first data line:
% chemfp csv2fps --describe $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: csv
delimiter: ',' (default)
quotechar: '"' (default)
escapechar: None (default)
doublequote: True (default)
skipinitialspace: False (default)
quoting: minimal
has_header: True
Titles:
#1: 'SMILES\tSMILES_CANONICAL\tMOLPORTID\tSTANDARD_INCHI\tINCHIK ... '
Columns for the first row:
#1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1\tCl.CCC(C(=O)OC1CC ... '
#2: '14-17H'
#3: '3'
#4: '9-12H2'
#5: '1-2H3;1H\tVEUYDINLFCGWRM-UHFFFAOYSA-N\tin stock'
Long strings are trimmed, and replace with the “…”, so they easily fit into an 80 column display.
The output fields are printed using Python’s string encoding. The many “\t” characters is strong evidence that the file is tab-separated.
I can use the “sniff” dialect to have Python’s csv module use heuristics to identify the dialect:
% chemfp csv2fps --describe --dialect sniff $FILENAME
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect:
delimiter: '\t'
quotechar: '"' (default)
escapechar: None (default)
doublequote: False
skipinitialspace: False (default)
quoting: minimal
has_header: True
Titles:
#1: 'SMILES'
#2: 'SMILES_CANONICAL'
#3: 'MOLPORTID'
#4: 'STANDARD_INCHI'
#5: 'INCHIKEY'
#6: 'AVAILABILITY'
Columns for the first row:
#1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1'
#2: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)c1ccccc1'
#3: 'MolPort-009-675-630'
#4: 'InChI=1S/C18H25NO2.ClH/c1-3-17(13-7-5-4-6-8-13)18(20)21-16- ... '
#5: 'VEUYDINLFCGWRM-UHFFFAOYSA-N'
#6: 'in stock'
or I can specify the “tsv” dialect, which gives the same titles and same columns for the first line.
Use --no-header
if the CSV file does not start with a header. For
this file that option produces the following:
% chemfp csv2fps --describe --dialect tsv $FILENAME --no-header
=== Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' ===
Dialect: tsv
delimiter: '\t'
quotechar: '"' (default)
escapechar: None (default)
doublequote: True (default)
skipinitialspace: False (default)
quoting: minimal
has_header: False
Titles: not available
Columns for the first row:
#1: 'SMILES'
#2: 'SMILES_CANONICAL'
#3: 'MOLPORTID'
#4: 'STANDARD_INCHI'
#5: 'INCHIKEY'
#6: 'AVAILABILITY'
csv2fps and column specifiers¶
The default for the csv2fps command is to parse the file in “csv” dialect (or “tsv” if the file ends in “.tsv”), treat the first line as a header, get the identifers from the first column and SMILES from the second column, and generate RDKit Morgan2 fingerprints. I know the file is tab-delimited (see the previous section), so I’ll specify “tsv” format and show the first two lines of fingerprint output:
% chemfp csv2fps -d tsv $FILENAME | head -8 | fold -w 70
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
useBondTypes=1
#software=RDKit/2021.09.4 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T09:10:17
0200000000000000000001000000800000000800000000000000000000000000000000
0040000000000004000000002020000000000000000000080000000000000000000000
4000000000000000000000040000000080000000000000000000000000008800000040
0000000000000000000040800000000000000000000008000000040202000001000000
4000000200000000008000000000000000000000001040000020000000002000100000
0000000000000000000000000000000000040100000020000000000002000000000000
0000000000000000400000000000000000000000000000000200000040300000000000
0000000000000000000000 Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1
0200000000000000008001000000200000000000000000000000001000002000000000
0800000000000000000000000020000000000000000000000000000000000000000000
0020000000000000000000040000000000000000000001000000000000408000000000
0000020000000000000000000000000000000000000002000000000204000001004000
0000000001000000008000000000000000000000000000000000000000002000100000
0000000000000000000000000000000000000000000020000000000108000000000000
0080000000000000400400000000000000000000000000010200000000200002000000
0000000000400000000000 CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1
This ended up interpreting the “‘SMILES” column as the id, and the “SMILES_CANONICAL” column as the SMILES, which is why csv2fps is able to generate output, though likely most people want the id from the MOLPORTID column.
Use --id-column
(or --id-col
for short) to specify the id
column to use. In the following I’ll specify it by title, and I’ll
disable the header lines (the “metadata”) using the --no-metadata
,
so the output contains just the first two lines of fingerprint data:
% chemfp csv2fps -d tsv $FILENAME --id-col MOLPORTID --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000 MolPort-009-675-630
02000000000000000080010000002000000000000000000000000010000020000000000800000000
00000000000000002000000000000000000000000000000000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000000200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000000000000000000001020000000020
00020000000000000000400000000000 MolPort-009-675-631
You can only specify a column title if the file contains headers.
You can also specify the column index by column number, starting
from 1. I’ll use --molecule-column
(or --mol-col
for short) to
read the SMILES from column #2 (“SMILES_CANONICAL”), and
--id-col
to read the SMILES from column #3:
% chemfp csv2fps -d tsv $FILENAME --id-col 3 --mol-column 2 --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000 MolPort-009-675-630
02000000000000000080010000002000000000000000000000000010000020000000000800000000
00000000000000002000000000000000000000000000000000000000000000200000000000000000
00040000000000000000000001000000000000408000000000000002000000000000000000000000
00000000000000020000000002040000010040000000000001000000008000000000000000000000
00000000000000000000200010000000000000000000000000000000000000000000000000200000
00000108000000000000008000000000000040040000000000000000000000000001020000000020
00020000000000000000400000000000 MolPort-009-675-631
If the specified column name is an integer then it is treated as an index, with 1 indicating the first column (0 and negative values are out of range).
If the first character of the column specifier is an “@” then the rest of the specifier is used as the column title. For example, if the second column is titled “10” then use “@10” as the column specifier. If the column title is “@A” then use “@@A” as the column specifier.
If the first character of the column specifier is a “#” then the rest of the specifier must be an integer, which is interpreted as the column index. For example, use “#3” to match the third column. If the column title starts with a “#” then use the “@” prefix (eg, if the column is “#entry” then specify it as “@#entry”).
Specify the csv2fps input structure format¶
By default mol2fps parses the molecule column as SMILES (in
“smi” format) . Use --format
to specify another format. For
example, the MolPort file column STANDARD_INCHI contains an InChI
string, and the INCHIKEY column contains the corresponding
InChIKey. I’ll read the structures from the former column, and the id
from the latter, and have chemfp parse the molecule as “inchi”:
% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
--format inchi --no-metadata | head -2 | fold
02000000000000000000010000008000000008000000000000000000000000000000000040000000
00000400000000202000000000000000000008000000000000000000000040000000000000000000
00040000000080000000000000000000000000008800000040000000000000000000004080000000
00000000000000080000000402020000010000004000000200000000008000000000000000000000
00104000002000000000200010000000000000000000000000000000000000000401000000200000
00000002000000000000000000000000000040000000000000000000000000000000020000004030
00000000000000000000000000000000 VEUYDINLFCGWRM-UHFFFAOYSA-N
02000000000000000080010000000000000000000000000000000000000000000010000800000000
00000000000000002000000000000000000000000000000800000000000000000004000000000000
00000000000000000000210000000000000000008000000000000002000000004000000000000000
00000000000000020000000002040000010000000000000000000000008000000000000000000000
00000000000000000000200010000000000010000000000000000000000000000100000000200000
00000000000000000000000000000008000440040800080000000000000000000001020000000200
00000000000000000000000000000000 BAHWBLGZSGIKQC-UHFFFAOYSA-N
If you leave out the format, or specify “smi” format, then it will try to interpret the InChI strings as SMILES, which will fail. As a safety mechanism, csv2fps will exit if none of the molecules in the first 100 records can be parsed. That looks like:
% chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \
--format smi --no-metadata | head -2 | fold
... many error lines removed ...
[11:56:09] SMILES Parse Error: syntax error while parsing: InChI=1S/C10H8C
l2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)
[11:56:09] SMILES Parse Error: Failed parsing SMILES 'InChI=1S/C10H8Cl2N4/
c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)' for in
put: 'InChI=1S/C10H8Cl2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5
H,(H3,13,14,15,16)'
Error: Each of the first 100 records contained structure errors.
Final error: RDKit cannot parse the SMILES 'InChI=1S/C10H8Cl2N4/c11-6-3-1-
2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)', column 'STANDAR
D_INCHI' (#4), record 100, line 101, 'fulldb_smiles-009-000-000--009-499-9
99.txt.gz'.
Exiting.
Use the -R
command-line option to specify toolkit- and
format-specific reader arguments. As a short-cut to enable or disable
CXSMILES extension support, use --cxsmiles
(enabled by default)
and --no-cxsmiles
.
structure formats with embedded identifiers¶
The previous examples read the id from one column, and the molecule from another, as a SMILES or InChI string. Sometimes the column contains a more complex record, like a SMILES record with both a SMILES string and identifier, or even as an embedded SDF. (The CSV format supports multi-line column entries.)
Use the --id-from-molecule
option (or --id-from-mol
for short)
to have csv2fps get the identifier from the parsed molecule
record. For a SMILES or InChI record this defaults to the rest of the
column entry. This can be changed with the --delimiter
option.
If the column entry is an SDF record then by default the identifier
comes from the title line. Use --id-tag
to get the identifier from
the named data value.
Specify an alternate fingerprint type¶
The examples so far generated RDKit Morgan fingerprints. To use an
alternative name specify a chemfp type string using --type
or use
the --with
option to get the fingerprints from the metadata of a
fingerprint file.
The following uses Open Babel to generate FP2 fingerprints:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type OpenBabel-FP2 \
| head -8 | fold
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/3.1.0 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:21:04
00260200002002000600060000020000001000100000000000000080010400001000500080000001
100a001820010100010200000020c004000000020000a20808100800000004400000880002800005
0090015000003008c008080000080000400000200000000010000000400008000007010000011000
0000000140000000 MolPort-009-675-630
00060000004004200b0006000001000c0000009000000000000000c0000800001800408300000000
00862038000100800102000008000000000000020000000000000400000400000020000002880006
00b0004000003008c0080900800800000c1000220000000000000000000000000006010400011000
0032800000000000 MolPort-009-675-631
The following uses CDK to generate CDK’s own implementation of the 881-bit PubChem fingerprints:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type CDK-Pubchem \
| head -8 | fold
#FPS1
#num_bits=881
#type=CDK-Pubchem/2.0
#software=CDK/2.8 chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:22:50
005e0c002000000000000000000000000080060000003c0200000000000000800000007800000000
00b03c8719604c10c10020001140044b100040000004000010118010001110044c01a90861040064
03801111e01913077101000000000000000000000000000000000000000000 MolPort-009-675-
630
00de0c000000000000000000000000000000000000000c0600000000000000800200007800080000
0030148319204c00410300001140844a080041000004000010118111201110064c01898c29041006
69001111e01811017100000000000000000000000000000000000000000000 MolPort-009-675-
631
The chemfp type string may contain fingerprint parameters. The following uses OEChem to generate OEGraphSim circular fingerprints, folded to 128 bit, instead of the default of 4096 bits:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 \
--type "OpenEye-Circular numbits=128" | head -8 | fold
#FPS1
#num_bits=128
#type=OpenEye-Circular/2 numbits=128 minradius=0 maxradius=5 atype=Arom|AtmNum|C
hiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:24:44
905a04205c4473c607f95c06123b2026 MolPort-009-675-630
a07873c1407c5c7617830d8428910170 MolPort-009-675-631
Finally, here’s an example using --with
to get the fingerprint
file from a file:
% chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --using reference.fps | \
head -7 | fold
#FPS1
#num_bits=4096
#type=OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 atype=Arom|AtmNum|
Chiral|EqHalo|FCharge|HCount btype=Order
#software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1
#source=fulldb_smiles-009-000-000--009-499-999.txt.gz
#date=2023-05-15T10:29:18
00000000000000000000000000000000000000000000080000000010000000000000000800000000
00200000000040000000000000000000000000000000000000000000000008000004000001080000
00000200000000000000000200000000000000000000100000000000002000200000000000010800
00008000000000000000000000000000000000008000040000000000040000000000000000002000
00008000020000000000000000000000000000040000000000000000000000000000000000800000
00000000000080000000000000000000000000000000000000000000000000000000000000000000
00800000000002000000000000000020002000000100000000080000000000000000000000000000
00000000000000000000000000000000000020000000000000000000000000000000080000000000
00000000000080008000000000000008000000002400000000000000000000000000000000020000
00000000000000004000000100100000000048000801840000000000001000000000020000000000
00000000000000000000000000000000000000000000000000000000000000000000000000001000
20008000000000000000000000200000000000000000000000000000000000000400000000000000
0020000804000000000000000000000000000000000000000000000000000000 MolPort-
009-675-630
CSV files with fingerprint data¶
Instead of storing a molecule, the CSV file might contain fingerprint data, represented in one of various encodings like hex or base64.
If this is the case, use --fingerprint-column
(or --fp-column
or --fp-col
for short) to indicate that the CSV file contains
fingerprint data, and to indicate which column contains that data.
csv2fps supports the same decoder options as sdf2fps. The default,
--hex
, treats the fingerprint as hex-encoded with bit - as the
first bit of the first byte. Use chemfp csv2fps
--help for the full list of fingerprint
decoder options.
csv character encoding¶
The “character set” describes how the text in the CSV file is encoded as a sequence of bytes. By default csv2fps assumes the CSV file is UTF-8 encoded.
This may be a problem with some programs which write data in another encoding, like “utf16” and “cp1252” for Windows or “latin1” for older Unix tools.
Use --encoding
to specify the input encoding, which must be one of
the supported Python encoding.
Use --encoding-errors
to describe how to handle input which could
not be decoded. This must be one of the supported Python encoding
handlers
“strict”, “ignore”, “replace”, or “backslashreplace”.
csv2fps processing errors¶
csv2fps can run into two main types of problems when processing a row of the file: 1) the row may not have enough colums, and 2) the toolkit may be unable to parse the molecule column or the fingerprint column.
The --errors
option specifies how to handle molecule and
fingerprint parse errors. The default is “report” for molecules and
“strict” for fingerprints. The “strict” handler prints an error
message and stops processing. The “report” handler prints an error
message and keeps on processing. The “keep” handler simply keeps on
processing.
The --csv-errrors
option specifies how to handle CSV parse
errors. The default is “strict”, which requires the id and molecule or
fingerprint column to exist, or print an error message and exit if
not. It can also be “report” and “keep”.
csv2fps TODO¶
There is currently no high-level chemfp.csv2fps
Python function
which is equivalent to the command-line tool. This will be added in a
future version of chemfp. Until then you can use a CSV reader
described below.
chemfp.csv_readers module¶
Chemfp 4.1 added the csv_readers
module with functions for
working with CSV files.
To start with, there is a utility function for specifying the csv dialect. It starts with a base dialect and lets you modify different values. The result can be used as a chemfp dialect and as input to Python’s own csv module.
>>> from chemfp import csv_readers
>>> csv_readers.get_dialect("csv")
CSVDialect(delimiter=',', quotechar='"', escapechar=None,
doublequote=True, skipinitialspace=False, quoting=0 ('minimal'),
lineterminator='\r\n', strict=False)
>>> csv_readers.get_dialect("csv", doublequote=False)
CSVDialect(delimiter=',', quotechar='"', escapechar=None, \
doublequote=False, skipinitialspace=False, quoting=0 ('minimal'),
lineterminator='\r\n', strict=False)
The lowest-level interface reads a CSV file, parses any available header, and iterates over rows in the file.
>>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz"
>>> reader = csv_readers.read_csv_rows(filename, dialect="tsv")
>>> reader.titles
['SMILES', 'SMILES_CANONICAL', 'MOLPORTID', 'STANDARD_INCHI',
'INCHIKEY', 'AVAILABILITY']
>>> next(reader)
['Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1', 'Cl.CCC(C(=O)OC1CC2CCC
(C1)N2C)c1ccccc1', 'MolPort-009-675-630', 'InChI=1S/C18H25NO2.ClH/c
1-3-17(13-7-5-4-6-8-13)18(20)21-16-11-14-9-10-15(12-16)19(14)2;/h4-
8,14-17H,3,9-12H2,1-2H3;1H', 'VEUYDINLFCGWRM-UHFFFAOYSA-N', 'in sto
ck']
>>> reader.close()
It also supports a context-manager, if you don’t want an explicit close() nor want to depend on Python’s garbage collector.
The read_csv_ids_and_molecules_with_parser()
function gives a
higher-level, albeit somewhat complicated, way to iterate over (id,
molecule) pairs in the file. It requires a function which takes the
text of the molecule field and returns an (id, molecule) pair – the
id is only used if id_column
is None:
>>> def convert(mol_text):
... return (None, (len(mol_text), mol_text))
...
>>> reader = csv_readers.read_csv_ids_and_molecules_with_parser(
... filename, convert, dialect="tsv", id_column=3, mol_column=1)
>>> reader
CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol
from column 'SMILES' (column #1), dialect='tsv', record_format=None,
filename='fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> for title in reader.titles: print(repr(title))
...
'SMILES'
'SMILES_CANONICAL'
'MOLPORTID'
'STANDARD_INCHI'
'INCHIKEY'
'AVAILABILITY'
>>> next(reader)
('MolPort-009-675-630', (40, 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1'))
This is easily modified to return only the molecule text rather than the (length, text) pair.
The columns can be specified as integers (the value 1 means the first column) or strings (to use the named column title).
This API is designed to be compatible with the “id and molecule” parser create by the toolkit wrappers:
>>> from chemfp import openeye_toolkit
>>> converter = openeye_toolkit.make_id_and_molecule_parser("smi")
>>> reader = csv_readers.read_csv_ids_and_molecules_with_parser(
... filename, converter, dialect="tsv", id_column=3, mol_column=1)
>>> next(reader)
('MolPort-009-675-630', <oechem.OEGraphMol; proxy of <Swig Object
of type 'OEGraphMolWrapper *' at 0x10e184db0> >)
NOTE: If you want to use a toolkit wrapper to parse molecules from a
CSV file then you almost certainly want to use the new - and simpler -
toolkit wrapper function read_csv_ids_and_molecules
.
Finally, use read_csv_ids_and_fingerprints()
to read the ids and
encoded fingerprints from a file. The default uses the “hex” decoder
(chemfp.bitops.hex_decode()
) but you can pass in your own
function returning the number of bits (which can be None) and the
decoded fingerprint as a byte string:
>>> def decode_fp(fp_text):
... return 32, x.to_bytes(4, "big")
...
>>> reader
CSVFingerprintReader(id from column 'MOLPORTID' (column #3), fp from
column 'SMILES_CANONICAL' (column #2), dialect='tsv', filename=
'fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> next(reader)
('MolPort-009-675-630', b'\x00\x00\x00*')
New toolkit wrapper functions to read CSV files¶
The toolkit wrappers have a new
read_csv_ids_and_molecules()
function which reads ids
and molecules from a CSV file, using a specified structure format.
>>> from chemfp import rdkit_toolkit as T
>>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz"
>>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv",
... id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi")
>>>
>>> reader
CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol
from column 'STANDARD_INCHI' (column #4), dialect='tsv', record_format=
'inchi', filename='fulldb_smiles-009-000-000--009-499-999.txt.gz')
>>> next(reader)
('MolPort-009-675-630', <rdkit.Chem.rdchem.Mol object at 0x10b0077d0>)
This works for all of the supported cheminformatics toolkits:
>>> from chemfp import rdkit_toolkit as T
>>> from chemfp import openbabel_toolkit as T
>>> from chemfp import openeye_toolkit as T
>>> from chemfp import cdk_toolkit as T
There is currently no direct way to read the molecules and convert them to fingerprints as an (id, fingerprint) pair. You will need to handle the conversion yourself, like the following:
>>> from chemfp import rdkit_toolkit as T
>>> fptype = T.morgan(fpSize=128)
>>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv",
... id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi")
>>>
>>> fptype = T.morgan(fpSize=128)
>>> id_fp_iter = ((id, fptype.from_mol(mol)) for (id, mol) in reader)
>>> next(id_fp_iter)
('MolPort-009-675-630', b'&\x15HD\xca\xa2\xc0\x00A\x00o\x02P\x00\xc0:')
The next version of chemfp will likely have a function for this operation.
translate command¶
Chemfp uses a “toolkit wrapper” to provide a consistent API for reading and writing structure files for each of the supported toolkits, and for generating fingerprints using the toolkit-specific fingerprint generators.
Chemfp 4.1 adds a new chemfp translate command-line tool which uses that wrapper API to read structure files in one format and write them to a file in another format.
If you do not specify which toolkit to use (via the --in-toolkit
option, or -in-tk
or -T
for short), chemfp will use the value
of the “CHEMFP_TOOLKIT” environment variable, which must be a
comma-separated list of toolkit names. If that variable is not
available then it uses the string “rdkit,openbabel,openeye,cdk”, that
is, it first checks for the RDKit toolkit, then the Open Babel
toolkit, then OpenEye’s OEChem toolkit, and finally the CDK toolkit
(via the jpype Python/Java adapter).
These examples will use the RDKit toolkit.
If unspecified, the translate command reads and writes in “smi” format, including possible CXSMILES extensions, and the output will be canonical SMILES but without CXSMILES:
% echo "OCC ethyl alcohol" | chemfp translate
CCO ethyl alcohol
% echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | chemfp translate
C[C@H](F)[C@H](C)[C@@H](C)Br example
Open Babel generate the same canonical SMILES:
% echo "OCC ethyl alcohol" | chemfp translate -T openbabel
CCO ethyl alcohol
Use --in
and --out
to specify the input and output format:
% echo "OCC phenol" | chemfp translate --out sdf3k
phenol
RDKit
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 3 2 0 0 0
M V30 BEGIN ATOM
M V30 1 O 0.000000 0.000000 0.000000 0
M V30 2 C 0.000000 0.000000 0.000000 0
M V30 3 C 0.000000 0.000000 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 1 1 2
M V30 2 1 2 3
M V30 END BOND
M V30 END CTAB
M END
$$$$
% echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | \
chemfp translate --out cxsmi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example
or depend on the input and filename extension:
% chemfp translate phenol.smi -o phenol.inchi
% cat phenol.inchi
InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 phenol
While chemfp can convert between different formats, it is not meant as a replacement for a tool like Open Babel, which is able to add or remove hydrogens, assign coordinates, and more as part of the translation process.
Translate with the ‘text’ toolkit¶
Chemfp includes the “text” toolkit, which knows just enough about SMILES and SDF formatted files to read and write records. It is not a cheminformatics toolkit and it does not handle chemistry or format conversion.
The text toolkit, as part of chemfp 4.1 support for CXSMILES, now supports heuristics which try to identify the CXSMILES extension in the SMILES file record.
This means the text toolkit can be used with translate as a way to remove the CXSMILES extensions from a SMILES file, without fully processing the SMILES string into a molecule object:
% cat test.smi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example
% chemfp translate test.smi -T text
C[C@H](F)[C@H](C)[C@@H](C)Br example
% chemfp translate test.smi -T text --out cxsmi
C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example
Unlike the chemistry toolkits, which only support UTF8-encoded input formats, the text toolkit supports other extended ASCII encodings, like latin1, which means it can be used to convert between two different encodings:
% chemfp translate -T text --in-encoding latin1 latin1.sdf -o utf8.sdf
However, you should almost certainly use a dedicated conversion tool for this purpose, like iconv:
% iconv -f latin1 -t utf-8 latin1.sdf > utf8.sdf
Translate via an intermediate format¶
The translate command can also do the conversion in two steps, that is, read the record in the input format, save it internally to an intermediate “via” format (which defaults to “sdf”), which is then parsed and written in the output format.
For example, if the input file is in OpenEye’s “oeb” format, and you want to convert it to Java code which will create the corresponding CDK molecule, then you can use OEChem to convert the OEB input to SDF and have CDK convert the SDF to “cdkjava” format, with the following:
% chemfp translate -T openeye theobromine.oeb -U cdk --out cdkjava
{
IChemObjectBuilder builder = DefaultChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newInstance(IAtomContainer.class);
IAtom a1 = builder.newInstance(IAtom.class,"C");
a1.setFormalCharge(0);
a1.setPoint2d(new Point2d(2.1348, 0.7541));
mol.addAtom(a1);
....
Admittedly this is mostly a technically interesting gimmick as I haven’t come up with a good example where this would be a useful way to convert between two chemistry formats.
One place it might be useful is to handle a legacy Latin-1 encoded structure file (or one which uses another extended ASCII encoding), and convert it to another chemistry format, like the following, which uses the “text” toolkit to read “latin1.sdf”, which is Latin-1 encoded, into an intermediate UTF-8-encoded SDF, then have RDKit translate the result into SMILES, sending the results to stdout:
% chemfp translate -T text --in-encoding latin1 -U rdkit latin1.sdf
translate_record function¶
The toolkit wrappers have a new translate_record
function which
convert a single record from one format to another format. By default
is parses from “smi” to a molecule then creates and returns in “smi”
format. The following shows how the four supported cheminformatics
toolkits canonicalize phenol:
>>> import chemfp
>>> from chemfp import rdkit_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'
>>> from chemfp import openeye_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'c1ccc(cc1)O phenol\n'
>>> from chemfp import openbabel_toolkit as T
>>T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'
>>> from chemfp import cdk_toolkit as T
>>> T.translate_record("c1ccccc1O phenol")
'C1=CC=C(C=C1)O phenol\n'
while here you can see they all generate the same InChI string:
>>> import chemfp
>>> for T in (chemfp.rdkit, chemfp.openeye, chemfp.openbabel, chemfp.cdk):
... print(T.name.rjust(10),
... T.translate_record("c1ccccc1", in_format="smistring", out_format="inchistring"))
...
rdkit InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
openeye InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
openbabel InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
cdk InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H
Structure I/O helper functions¶
Chemfp 4.0 introduced a few helper functions to make it easier to read and write structure files, and process structure records, when the format was known.
These were:
from_smi()
: parse a SMILES record into a moleculefrom_smi_file()
: read a SMILES file iterating over moleculesto_smi()
: convert a molecule to SMILESto_smi_file()
: open an output file for writing molecule as SMILES
plus equivalents for SDF and InChI, and functions for reading and writing SMILES string and InChI strings (these only contain a single struture, with neither identifier nor newline).
These names proved hard to remember because they were not consistent with the rest of the chemfp API and did not include enough information about what they did.
For example, if I wanted to read a SMILES file I would start read_ then use tab-complete to see what was available. That would give me read_ids_and_molecules and read_molecules, but not show from_smi_file.
Even if I remembered to use from_smi_file, I would forget if the iterator only returned molecules, or the more useful (id, molecule) pairs.
I would also forget that from_smi was to read a record in “smi” format, and not (as I thought) to read a file in “smi”.
The new shortcut functions in chemfp 4.1 are:
read_smi_ids_and_molecules()
: read a SMILES file iterating over (id, molecule) pairs.read_smi_molecules()
: parse a SMILES file iterating over moleculesparse_smi()
: parse a “smi” record into a moleculecreate_smi()
: convert a molecule to a “smi” recordopen_smi_writer()
: open an output file for writing molecule as SMILES
These are also available for “sdf” and “inchi” formats, as well as the parse/create functions for “smistring” and “inchistring”, which are the format names for just the SMILES and InChI components of the “smi” and “inchi” records, respectively.
A few other formats, like “helm”, and “fasta” are present on a toolkit-specific basis, but are probably not useful and are likely to be removed in the future unless people tell me they like them.
In addition, “*_from_string()` and “*_to_string()” variants are also
now supported, like read_smi_ids_and_molecules_from_string()
and
open_smi_writer_to_string()
.
The old shortcuts are still available but generate a PendingDeprecationWarning. They will be removed in a future version of chemfp.
Other API changes¶
Chemfp uses a random number generator for some of the diversity selection and clustering options. The RNG is initialized with a 64-bit integer seed. If the seed is not specified then chemfp uses Python’s RNG to generate the seed at random (Python’s RNG uses a number of methods to generate its own seed). Chemfp 4.0 requested a 64-bit number from Python, which on average takes about 20 characters to represent in the output parameters, which drew the eye far more than it was useful. Chemfp 4.1 requests a 32-bit numbers as its initial RNG, which shouldn’t meaningfully affect the scientific results.
The FingerprintArena.copy()
command now accepts
ids
. If specified, it contains the list of ids to use in the
copy. This is especially useful when making a subset given a list of
indices, and you want to be able to map the new subset back to the
original index. In that case use:
b = a.copy(indices=offsets, ids=offsets)
so the fingerprint at position i in the new arena was at position
b.ids[i]
in the original arena.
The diversity pickers now support a “max_time” parameter, to stop picking after a specified amount of seconds (given as a 64-bit float). The timeout is checked after each pick has finished. This helps gives smooth progress bars or other interactive feedback about the picking process,
DEPRECATION NOTICE: The bitops.byte_difference()
and
bitops.hex_difference()
functions have been renamed
bitops.byte_xor()
and bitops.hex_xor()
, mostly because I
couldn’t remember what “difference” meant but I knew exactly what
“xor” meant. The old functions still work, but generate a
PendingDeprecationWarning. They will likely be removed in a future
version of chemfp.