chemfp simarray

The “chemfp simarray” command-line tool generates the all-by-all comparisons between a set of queries with targets, or with a set of fingerprints with itself, for use as a NumPy array. It can even generate an array which is too large to fit into RAM. The simarray output is either a NumPy “npy” file or a “bin” file containing the raw array bytes plus an optional auxillary metadata file in “npy” format.

Use chemfp.load_simarray() to read the output file(s) as a NumPy array along with with auxillary metadata.

The related chemfp.simarray() function is the comparable Python function to generate the all-by-all comparisons as an in-memory NumPy array.

The rest of this chapter contains the output from chemfp simarray --help, chemfp simarray --help-formats, and chemfp simarray --help-metrics.

chemfp simarray command-line options

The following comes from chemfp simarray --help:

Usage: chemfp simarray [OPTIONS] TARGETS

Generate an array containing the full all-by-all comparisons.

Options:
  -m, --metric [Tanimoto|Dice|cosine|Hamming|Sheffield|Willett|Daylight]
                                  Similarity score, distance type, or
                                  comparison values to generate.
  --as-distance                   Compute (1.0-similarity) to convert a
                                  similarity to a distance
  --dtype [float64|float32|rational64|rational32|uint16|abcd]
                                  Specify the output array data type.
  --include-lower-triangle / --no-lower-triangle
                                  In NxN array generation, use --no-lower-
                                  triangle to leave the lower triangle as
                                  zeros.
  -q, --queries PATH              Filename containing the query fingerprints.
  --hex-query, --hex HEX_STR      Query fingerprint in hex.
  --in, --target-format FORMAT    Input target format (default uses the file
                                  extension, else 'fps')
  --query-format FORMAT           Input query format (default uses the file
                                  extension, else 'fps')
  -o, --output FILENAME           Output filename (default is stdout)
  --out FORMAT                    Output format (default uses the file
                                  extension, else 'npy'). Use --help-formats
                                  for details.
  --include-ids / --no-ids        With --no-ids, do not include ids in the
                                  output.
  --metadata-output PATH          Save a zero-sized array, metadata, and ids
                                  to the named file, in npy format. (Use this
                                  to get the metadata when using --out bin.)
  -j, --num-threads N             The number of threads to use. -1 means the
                                  default value (which is 1 for this
                                  computer), and can be set using
                                  $OMP_NUM_THREADS. 0 and 1 both mean single-
                                  threaded. (default: -1)
  --window-size GB                Amount of memory to use when generating
                                  'bin'ary output, in GiB (default: 2)
  --batch-size INTEGER RANGE      Number of comparisons to compute between
                                  status updates. (default: 100000000)  [x>=1]
  --no-mmap                       Don't use mmap to read uncompressed FPB
                                  files. May give better performance on
                                  networked file systems, at the expense of
                                  higher memory use.
  --times / --no-times            Write timing information to stderr
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help-metrics                  Describe the metrics and associated dtypes.
  --help-formats                  Describe the output formats.
  --help                          Show this message and exit.

  Generate and write an array containing all NxN or NxM similarity scores.

  By default, generate an array of 64-bit doubles where element [i, j]
  contains the Tanimoto score between fingerprint i and fingerprint j
  (indexing starts from 0) in a target fingerprint data set. This array
  includes the Tanimoto score of the fingerprint with itself.

  The available `--metrics` are: "Tanimoto", "Dice", "cosine", "Hamming",
  "Sheffield", "Willett", and "Daylight".  (I'll use "metric" to include
  similarity and 4-tuple elements, even though those are not mathematical
  metrics.). For full details use `--help-metrics`.

  Some metrics support multiple representations. For example, the "Tanimoto"
  metric supports "float64" (64-bit doubles), "float32" (32-bit doubles),
  "rational63" (a pair of 32-bit unsigned integers), "rational32" (a pair of
  16-bit unsigned integers), or "uint16" (a scaled 16-bit unsigned integer).
  Use `---dtype` to specify which type to use.

  The "Sheffield", "Willett", and "Daylight" metrics only support the "abcd"
  dtype, which contains a 4-element (a, b, c, d) term containing individual
  comparison counts as 16-bit unsigned integers.

  Use `--help-metrics` for more details about the available data types and
  these "abcd" metrics.

  Use `--as-distance` to convert the Tanimoto, Dice, or cosine score into a
  distance, computed as 1.0 - similarity or the equivalent for rational and
  scaled dtypes. Note: in chemfp a similarity score of 0/0 is 0, and its
  distance is 1, or B/B for rational dtypes where B is the total number of
  bits, or 65535 for uint16.

  The target fingerprints may be specified as a filename on the command-line,
  or read from stdin. By default the format is inferred from the filename
  extension, defaulting to "fps". Use `--in` or `--target-format` to specify
  the input format.

  If only target fingerprints are specified then the full NxN symmetric array
  is generated. Use `--no-lower-triangle` to only generate the upper-triangle
  and diagonal term, and leave the lower triangle as its default empty value.
  This will still save the full NxN array; *it does not flatten the triangle*!

  If a `--queries` filename is also specified then simarray will generate the
  NxM matrix of all N query fingerprints with all M target fingerprints. The
  format will be inferred from the query filename, defaulting to "fps". Use
  `--query-format` to specify the format.

  Use `--hex-query` to specify a single fingerprint query as a hex string.
  This will generate a 1xM array.

  Use `-o` or `--output` to specify where to write the array. For "npy" output
  this must be a seekable file (eg, not a fifo) due to a NumPy requirement.
  The default is stdout, with a special work-around for the NumPy limitation.
  Chemfp will refuse to write "npy" output, which is a binary format, to a tty
  like the terminal. If that is what you want, pipe the output through cat.

  The two supported `--out` formats are "npy" (which generates the array in-
  memory before writing it to a file) and "bin" (which support larger-than-
  memory arrays by iteratively processing and saving parts of the array).

  See `--help-formats` for details, including how to use the `--metadata-
  output` option to save useful array metadata for the "bin" output format.

  NOTE: simarray does not support fingerprints with more than 2**15 bits. The
  Dice metric does not support fingerprints with 2**15 bits.

  Examples:

  1) Generate the full Tanimoto matrix for my_data.fps:

    chemfp simarray my_data.fps -o my_array.npy

  2) Generate the upper-triangular cosine distance matrix using 32-bit floats:

    chemfp simarray my_data.fps  --metric cosine --dype float32 \
       --as-distance --no-lower-triangle -o cosine.npy

  3) Generate the Hamming distances between two sets of fingerprints but do
  not include the metadata or identifiers

    chemfp simarray --queries query_data.fps my_data.fps \
        --metric Hamming --no-metadata --no-ids -o hamming.npy

Supported simarray formats

The following comes from chemfp simarray --help-formats:

The "chemfp simarray" command supports two output formats: "npy" and "bin".

The "npy" format generates the full simarray in-memory before writing the
data.  The "bin" format processes and saves parts of the array in memory at a
time, to generate the full final array in an output file.

The "npy" format is the standard NumPy format which can be read with
`numpy.load`. The format allows zero or more NumPy arrays, stored
successively. The simarray output stores up to four arrays: the similarity
array, an array containing JSON metadadata, the target identifiers, and the
query identifiers.

The "bin" format stores only the raw bytes for the similarity array. If there
are 20 similarity scores, each as an 8-byte double, then the file will be 160
bytes long, with no NumPy metadata describing the array size or data type.

If not specified by the `--out` command then the format is inferred by looking
at the `--output` filename extension (if present), otherwise it defaults to
"npy".

The "npy" format contains up to four arrays, each also containing NumPy shape
and dtype information.

1) the simarray NumPy array, which can be 1-D (when given a query fingerprint)
or 2-D (if processing a set of fingerprints against itself, or a set of
queries against targets);

2) the simarray metadata stored as a JSON string in a 1-element NumPy string
array (see below for details);

3) the fingerprint identifiers (if processing a set of fingerprints against
itself) or the target identifiers (if processing a set of query fingerprints
against a set of target fingerprints);

4) the query fingerprint identifiers, if there are query fingerprints.

Use `--no-metadata` to not save the metadata array. Use `--no-ids` to not save
the fingerprint identifiers.

The JSON string looks like:

  {"format": "multiple",  // can be 'single' for a --hex-query
   "num_bits": 64,        // number of bits in the fingerprint
   "method": "Tanimoto",  // base method used to compute the values
   "metric_description": "Tanimoto similarity", // a human-readable string
   "metric": {
     "name": "Tanimoto",     // the metric name
     "as_distance": false,   // True for --as-distance
     "is_similarity": true,  // True if the array contains similarity scores
     "is_distance": false    // True if the array contains distance values
   },
   "matrix_type": "upper-triangular",  // either 'upper-triangular', 'NxN, or 'NxM'
   "shape": [100, 125]       // the shape of the generated NumPy array
  }

The "bin" format contains a single array, which can be either 1-D (when given
a query fingerprint) or 2-D (if processing a set of fingerprints against
itself, or a set of queries against targets).

The format is identical to the numpy.ndarray.tobytes() representation, which
contains the raw data bytes in "C" order (that is, row-major order, where the
scores for the first query come first, then the scores for the second, and so
on.)

The "bin" format does not contain the array shape or the NumPy dtype
information. To get that information, use `--metadata-output` to save the
metadata in "numpy" format.

The "--metadata-output" format saves the simarray metadata in "npy" format
except that the data array (the first array in the file) is empty, that is,
for 1-D output array it is 0-length vector, and for a 2-D output array it is a
0x0 size array. This array has the same NumPy dtype as the generated array.

The second array, which contains the JSON metadata, can be parsed to get the
shape.

Here is an example "npy" loader which assumes the data, metadata, and
appropriate identifier arrays are present:

  import numpy as np
  import json
  def load_chemfp_npy(filename):
    with open(filename, "rb") as f:
      data_array = np.load(f)
      metadata_array = np.load(f)
      metadata = json.loads(str(metadata_array))
      target_ids = np.load(f)
      try:
          query_ids = np.load(f)
      except EOFError:
          query_ids = None
      return (data_array, metadata, query_ids, target_ids)

Simarray metrics

The following comes from chemfp simarray --help-metrics:

The simarray methods implements several of the most common comparison metrics,
as well as options to generate the individual fingerprint comparison
components in one of three "abcd" conventions.

The standard metrics (specified by `--metric`), and their supported data types
(specified by `--dtype` with the default dtype listed first) are:

  Tanimoto = popcount(fp1 & fp2) / popcount(fp1 | fp2)
     dtypes = [float64, float32, rational64, rational32, uint16]

  Dice = 2 * popcount(fp1 & fp2) / (popcount(fp1) + popcount(fp2))
     dtypes = [float64, float32, rational64, rational32, uint16]

  cosine = popcount(fp1 & fp2) / (popcount(fp1) * popcount(fp2))
     dtypes = [float64, float32, uint16]

  Hamming = popcount(fp1 ^ fp2)
     dtypes = [uint16]

where given two fingerprints fp1 and fp2:

  popcount(fp1) = the number of 1-bits in fp1
  popcount(fp2) = the number of 1-bits in fp2
  popcount(fp1 & fp2) = the number of 1-bits common to both fp1 and fp2
  popcount(fp1 | fp2) = the number of distinct 1-bits in fp1 and fp2
  popcount(fp1 ^ fp2) = the number of bits which differ between fp1 and fp2

The corresponding dtypes are:

  float64 - a 64-bit float (a.k.a "double")
  float32 - a 32-bit float (a.k.a "float")
  rational64 - a structured NumPy dtype with the structure (uint32, uint32)
  rational32 - a structured NumPy dtype with the structure (uint16, uint16)
  uint16 - an unsigned 16-bit integer

The rational formats contain the numerator and the denominator. The rational
value is not necessarily in reduced form (eg, it may store (2, 4) instead of
(1, 2)).

For the Tanimoto, Dice, and cosine metrics the "uint16" is the
floor(float64_score * 65535) to give a 2-byte value between 0 (no similarity)
and 65535 (100% similarity).

Dozens of other metrics have been proposed over the decades, and most of the
cheminformatics papers which describe them use an "a", "b", "c", "d" notation
to describe the metric given four individual components for how to compare two
fingerprints. These are available in simarray as a structured NumPy dtype with
four uint16 terms labeled "a", "b", "c", and "d".

Unfortunately, different papers use different notations, and even the same
author might use a different notation in two different papers.

I've reviewed about 40 years of publications and found that nearly all of them
use one of three conventions, which I've termed "Sheffield", "Willett", and
"Daylight".

The "Sheffield" convention uses the following definition:

  "a" = the number of on-bits in common
      = popcount(fp1 & fp2)
  "b" = the number of on-bits in fp1 which are off-bits in fp2
      = popcount(fp1) - popcount(fp1 & fp2)
  "c" = the number of on-bits in fp2 which are off-bits in fp1
      = popcount(fp2) - popcount(fp1 & fp2)
  "d" = the number of off-bits in fp1 which are also off-bits in fp2
      = popcount((~fp1) & (~fp2))

where "~fp" inverts the bits in the fingerprint. Note that a+b+c+d equals the
number of bits in the fingerprint.

This was used at Sheffield during the 1970s and early 1980s, and started being
used again by 2002 and in full-swing after 2010. If John Holliday is one of
the authors of a post-2000 Sheffield paper then it probably uses this
convention.

The "Willett" convention uses the following definition:

  "a" = the number of on-bits in the first fingerprint
      = popcount(fp1)
  "b" = the number of on-bits in the second fingerprint
      = popcount(fp2)
  "c" = the number of on-bits in common
      = popcount(fp1 & fp2)
  "d" = the number of off-bits in fp1 which are also off-bits in fp2
      = popcount((~fp1) & (~fp2))

This was first used by Peter Willett in the 1980s and was the primary
definition used in Sheffield publications until around 2010. It is also the
most common definition used by non-Sheffield researchers. Note that a+b-c+d
equals the number of bits in the fingerprint.

The "Daylight" convention uses the following definition:

  "a" = the number of on-bits unique to the first fingerprint
      = popcount(a) - popcount(a&b)
  "b" = the number of on-bits unique to the second fingerprint
      = popcount(b) - popcount(a&b)
  "c" = the number of on-bits in common
      = popcount(fp1 & fp2)
  "d" = the number of off-bits in fp1 which are also off-bits in fp2
      = popcount((~fp1) & (~fp2))

This was first used by John Bradshaw in 1997 when he presented the Tversky
similarity at a Daylight user group meeting. It was later used by the Daylight
toolkit to allow for custom metric definitions. These convention is mostly
used by people or documentation closely related to Daylight. Note that a+b+c+d
equals the number of bits in the fingerprint.

If you want to use an "abcd" notation from an existing paper, figure out which
convention they use, ask simarray to compute the appropriate components using
`--metric`, then use NumPy (or Numba or whatever else) to compute the
expression based on those terms.