chemfp fpc2fps

The “chemfp fpc2fps” command-line tool converts FPC files containing sparse count fingerprints into FPS or FPB files containing binary fingerprints.

It is impossible to convert the full range of sparse count fingerprints into binary fingerprints, much less into binary fingerprints whose pairwise Tanimoto scores are identical to the generalized Jaccard index of the original fingerprint pairs, but we can try, using one of several approaches.

The default “superimpose” method assumes the total number of features is well under the output fingerprint size. The “seq” method assumes there are a small number of distinct features (a few dozen), with a relatively low maximum count (under 100). The “rdkit” method implements RDKit’s “count simulation” method.

For full details about these, plus a couple of others, use chemfp fpc2fps --help-methods.

The rest of this chapter contains the output from chemfp fpc2fps --help and chemfp fpc2fps --help-methods.

chemfp fpc2fps command-line options

The following comes from chemfp fpc2fps --help:

Usage: chemfp fpc2fps [OPTIONS] [FILENAMES]...

  Convert count fingerprints to binary.

Options:
  --fold                          Map feature i to bit (i % num_bits). Ignore
                                  count.
  --rdkit-count-sim, --rdkit      Use the RDKit count simulation algorithm.
  --scaled                        Like --superimpose, but with a user-defined
                                  mapping from count to the number of samples
                                  to use.
  --seq                           Sequentially map feature i counts to the
                                  first N_i bits using unary count encoding.
  --seq-scaled                    Like --seq, but with a user-defined mapping
                                  from count to unary encoding.
  --superimpose                   For each feature i, seed RNG(i) then repeat
                                  count*bits_per_count times to set bit
                                  rng.range(num_bits) to 1. [default]
  --num-bits INT                  Output fingerprint size. If the size cannot
                                  be inferred from other options then default
                                  to 2048 bits.
  --bits-per-count INT            Multiplier for the number of samples to
                                  generate for each count. Default: 1
                                  [superimpose]
  --max-count INT|none            Maximum count to use, or 'none'. Default:
                                  'none'. [superimpose]
  --sizes N0,N1,N2,...            Comma-separated list of counts for
                                  sequential features i=0,1,... [seq]
  --countBounds INT,INT,...       List of minimum counts needed to set the
                                  corresponding bit in RDKit count simulation,
                                  eg, '1,2,4,8' [rdkit-count-sim]
  --table STR                     Table mapping 1 or more feature ids to a
                                  count scale. Eg, '0,98->1:2,5:2/7->3:1'. See
                                  below for details. [seq, seq-scaled]
  --scale STR                     Default count scale. Eg,
                                  '1:1,4:2,8:4,16:32'. See below for details.
                                  [scaled]
  --in FORMAT                     Input fingerprint format. One of fpc,
                                  fpc.gz, or fpc.zst. (default guesses from
                                  filename or is fpc)
  -o, --output FILENAME           Save the fingerprints to FILENAME
                                  (default=stdout)
  --out FORMAT                    Output format, one of 'fps', 'fps.gz',
                                  'fps.zst', 'fpb', or 'flush' (default
                                  guesses from output filename, or is 'fps')
  --include-metadata / --no-metadata
                                  With --no-metadata, do not include the
                                  header metadata for FPS output.
  --no-date                       Do not include the 'date' metadata in the
                                  output header
  --date STR                      An ISO 8601 date (like
                                  '2025-02-07T11:10:15') to use for the 'date'
                                  metadata in the output header
  --progress / --no-progress      Show a progress bar (default: show unless
                                  the output is a terminal)
  --help-methods                  Give more details about the conversion
                                  methods
  --help                          Show this message and exit.

  Convert the count fingerprint into binary fingerprints using one of several
  methods. The count fingerprints contain a sequence of 0 or more feature
  entries, each with a feature id and non-negative count. The output is a
  binary fingerprint of size --num-bits.

  Here are the supported conversion methods. Use --help-methods for more
  details about each one.

  * The --superimpose method distributes the counts randomly. It takes three
  parameters: --num-bits, --bits-per-count, and --max-count. For each feature
  id it seeds an RNG then selects count * bits_per_count randomly chosen
  values up to num_bits and sets the corresponding output bit to 1. This is
  the default method.

  * The --scaled method is a variation of the --superimpose method with
  rescaled counts. It takes three parameters: --num-bits, --table and --scale.
  It might be used to implement a form of TF-IDF (Term Frequency-Inverse
  Document Frequency). Contact me if you want to explore this option.

  * The --fold method uses the feature id modulo the number of bits to set the
  output bit. The counts are ignored. It takes one parameter: --num-bits.

  * The --rdkit-count-sim method emulates the RDKit count simulation
  algorithm. It takes two parameters: --num-bits and --countBounds. The
  countBounds contains a list of minimum counts, like "1,2,4,8". See --help-
  methods for full details.

  * The --seq method maps dense count fingerprints into a sequence of bins
  using unary count encoding. It takes two parameters: --sizes containing a
  list of bin sizes, and an optional --num-bits. If not specified then --num-
  bits is the sum of the bin sizes.

  * The --seq-scaled method is a variation of the --seq method with rescaled
  counts. It takes two parameters: --table, and an optional --num-bits.

  The "scaled" and "seq-scaled" methods use a scale to convert from the
  original feature count to the actual count to use, called "repeat". Each
  scale contains 1 or more scale terms. Each scale term contains a minimum
  count and a repeat. The scale terms must be in increasing minimum count
  order. Given a feature count, the repeat for the scale term with the largest
  minimum count <= count is used. The repeat is 0 if there is no matching
  scale term.

  The scale "2:1" uses a repeat of 1 if the count >= 2.

  The scale "1:1,2:2,4:3,8:4,16:5,32:6,64:7,128:8" is equivalent to
  int(log2(min(128, count))). (The count must be a positive integer.).

  The "--scale" option specifies the default scale for the scaled method. By
  default it is "1:1".

  The "--table" option contains the scales to use for one or more named
  feature ids. For example, "123->1:1,10:2/456,789->1:1,8:2,64:3" uses the
  scale "1:1,10:2" for sparse feature bit 123 and the scale "1:1,8:2,64:3" for
  the sparse feature bits 456 and 789.

  NOTE: The fpc2fps tool is EXPERIMENTAL and the command-line API is NOT
  STABLE. Please provide feedback to help guide its development.

Supported fpc2fps methods

The following description of methods to convert sparse count fingerprints to binary fingerprints comes from chemfp fpc2fps --help-methods:

Here are more details about the fpc2fps methods to convert a sparse
fingerprint into a binary fingerprint.

To start, an input sparse count fingerprint contains zero or more features.
Each feature contains a feature id (also called bit number) and a count. The
input features are represented by a string. The string "*" means there are no
features. The string "505:47" means there is one feature with feature id 505
and count 47. The string "10,11:7,12,15:2" means there are three features;
features 10 and 12 have a count of 1, feature 11 has a count of 7, and feature
15 has a count of 2. (If the count isn't specified, it's assumed to be 1.)

The output binary fingerprint is "num_bits" bits long. This can be specified
with --num-bits or for the "seq" and "seq-scaled" methods computed from the
other arguments.

* The --superimpose method distributes the counts randomly. It takes three
parameters: --num-bits, --bits-per-count, and --max-count. For each feature id
it seeds an RNG then selects count * bits_per_count randomly chosen values up
to num_bits and sets the corresponding output bit to 1. This is the default
method.

This pseudocode describes the details:

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        rng = RNG(feature_id)
        for _ in range(min(count, max_count) * bit):
            bitno = rng.randrange(num_bits)
            output_fp.SetOnBit(bitno)

Example:

  fpc2fps --superimpose --num-bits 2048 --bit-per-count 2 --max-count 100

* The --scaled method is a variation of the --superimpose method with rescaled
counts. It takes three parameters: --num-bits, --table and --scale. It might
be used to implement a form of TF-IDF (Term Frequency-Inverse Document
Frequency). Contact me if you want to explore this option.

For each feature id and count, the method gets the corresponding scale for the
feature id from --table, if present, otherwise it uses the --scale scale, then
uses the count to get the repeat value, which is used to generate the
superimposed bits. This pseudocode describes the details:

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        scale = table.get(feature_id, default)
        repeat_to_use = 0
        for min_count, repeat in scale[feature_id]:
            if min_count <= count:
                repeat_to_use = repeat
            else:
                break
        rng = RNG(feature_id)
        for _ in range(repeat_to_use)
            bitno = rng.randrange(num_bits)
            output_fp.SetOnBit(bitno)

Example:

  fpc2fps --table "1,5,30->4:1,6:2,10:3/22,23->1:1,2:2" \       --scale "1:1"

This says that feature ids 1, 5, and 30 are interpreted using the scale
"4:1,6:2,10:3" and feature ids 22 and 23 are interpreted using the scale
"1:1,2:2", while all other features are interpreted using the default scale
table "1:1".

* The --fold method uses the feature id modulo the number of bits to set the
output bit. The counts are ignored. It takes one parameter: --num-bits. This
pseudocode describes the details:

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        output_fp.SetOnBit(feature_id % num_bits)

Example:

  fpc2fps --fold --num-bits 1024

* The --rdkit-count-sim method emulates the RDKit count simulation algorithm.
It takes two parameters: --num-bits and --countBounds. The countBounds
contains a list of *k* minimum counts, like "1,2,4,8".

It folds the sparse count fingerprint into a summed count fingerprint with an
effective size of num_bits // k, then for each folded bitno and summed count
it uses the countBounds to set 0 or more bits in the final binary fingerprint,
along the lines of this pseudocode:

    output_fp = BinaryFingerprint(num_bits)
    k = len(countBounds)
    effective_size = num_bits // k

    # Created a dense summed count fingerprint
    counts = [0] * effective_size
    for feature_id, count in features:
        counts[feature_id % effective_size] += count

    # Convert to the binary fingerprint
    for folded_bitno, count in enumerate(counts):
        for offset, min_count in enumerate(countBounds):
            if min_count <= count:
                output_fp.SetOnBit(folded_bitno*k + offset)

Example:

  fpc2fps ---count-sim --countBounds 1,2,4,8,16

* The --seq method maps dense count fingerprints into a sequence of bins using
unary count encoding. It takes two parameters: --sizes containing a list of
bin sizes, and an optional --num-bits. If not specified then --num-bits is the
sum of the bin sizes.

A "dense" count fingerprint type has N distinct features (up to a few
hundred), with feature ids in sequential order 0, 1, ... (N-1).

Unary encoding uses one on-bit for each count, up to the total available size.
If the bin size is 5 then the unary encoding for 2 is "11000" and the unary
encoding for 5 is "11110", when expressed in binary.

This method maps each count feature to a bin, in sequential order. (Feature id
0 maps to the first bin, feature id 1 maps to the second, etc.) The feature
count is unary encoded into the bin, up to the size of the bin. This
pseudocode describes the details:

    output_fp = BinaryFingerprint(num_bits)
    for feature_id, count in features:
        bin_size = sizes[feature_id]
        bin_start = sum(sizes[:feature_id])
        for bitno in range(bin_start, bin_start+min(count, bin_size)):
            output_fp.SetOnBit(bitno)

Example:

  fpc2fps --seq 100,100,50

This places the counts for feature 0 into the first 100 bits, the counts for
feature 1 into the second 100 bits, and the counts for feature 2 into the last
50 bits.

* The --seq-scaled method is a variation of the --seq method with rescaled
counts. It takes two parameters: --table, and an optional --num-bits.

This method maps each count feature to a bin, in sequential order. (Feature id
0 maps to the first bin, feature id 1 maps to the second, etc.) The feature
count is converted into a repeat count using the scale table for the feature
id. The repeat count is unary encoded into the bin, up to the size of the bin.
This pseudocode describes the details:

    output_fp = BinaryFingerprint(num_bits)
    sizes = determin_bin_sizes(table)
    for feature_id, count in features:
        repeat_to_use = 0
        for min_count, repeat in table[feature_id]:
            if min_count <= count:
                repeat_to_use = repeat
            else:
                break
        bin_size = sizes[feature_id]
        bin_start = sum(sizes[:feature_id]
        for bitno in range(bin_start, bin_start+repeat_to_use):
            output_fp.SetOnBit(bitno)

Example:

  fpc2fps --seq-scaled --table "0,2->1:1,2:2,4:3,8:4,16:5/1->1:1,3:2,7:3"

This defines two scales: "1:1,2:2,4:3,8:4,16:5" and "1:1,3:2,7:3" with sizes 5
and 3 bits, respectively. Features 0 and 2 use the first scale and feature 1
uses the second scale, giving three bins, of sizes 5, 3, and 5.

If the count for feature 0 is 16 or larger then the 5 bits of bin 0 is
"11111". Otherwise, if the count for feature 0 is 8 or larger then the bits of
bin 0 is "11110". Otherwise, if the count is 4 or larger then the bits are
"11100". Otherwise, if the count is 2 or larger then the bits are "110000".
Otherwise, the bits are "10000".