chemfp fpc2fps¶
The “chemfp fpc2fps” command-line tool converts FPC files containing sparse count fingerprints into FPS or FPB files containing binary fingerprints.
It is impossible to convert the full range of sparse count fingerprints into binary fingerprints, much less into binary fingerprints whose pairwise Tanimoto scores are identical to the generalized Jaccard index of the original fingerprint pairs, but we can try, using one of several approaches.
The default “superimpose” method assumes the total number of features is well under the output fingerprint size. The “seq” method assumes there are a small number of distinct features (a few dozen), with a relatively low maximum count (under 100). The “rdkit” method implements RDKit’s “count simulation” method.
For full details about these, plus a couple of others, use chemfp fpc2fps --help-methods.
The rest of this chapter contains the output from chemfp fpc2fps --help and chemfp fpc2fps --help-methods.
chemfp fpc2fps command-line options¶
The following comes from chemfp fpc2fps --help
:
Usage: chemfp fpc2fps [OPTIONS] [FILENAMES]...
Convert count fingerprints to binary.
Options:
--fold Map feature i to bit (i % num_bits). Ignore
count.
--rdkit-count-sim, --rdkit Use the RDKit count simulation algorithm.
--scaled Like --superimpose, but with a user-defined
mapping from count to the number of samples
to use.
--seq Sequentially map feature i counts to the
first N_i bits using unary count encoding.
--seq-scaled Like --seq, but with a user-defined mapping
from count to unary encoding.
--superimpose For each feature i, seed RNG(i) then repeat
count*bits_per_count times to set bit
rng.range(num_bits) to 1. [default]
--num-bits INT Output fingerprint size. If the size cannot
be inferred from other options then default
to 2048 bits.
--bits-per-count INT Multiplier for the number of samples to
generate for each count. Default: 1
[superimpose]
--max-count INT|none Maximum count to use, or 'none'. Default:
'none'. [superimpose]
--sizes N0,N1,N2,... Comma-separated list of counts for
sequential features i=0,1,... [seq]
--countBounds INT,INT,... List of minimum counts needed to set the
corresponding bit in RDKit count simulation,
eg, '1,2,4,8' [rdkit-count-sim]
--table STR Table mapping 1 or more feature ids to a
count scale. Eg, '0,98->1:2,5:2/7->3:1'. See
below for details. [seq, seq-scaled]
--scale STR Default count scale. Eg,
'1:1,4:2,8:4,16:32'. See below for details.
[scaled]
--in FORMAT Input fingerprint format. One of fpc,
fpc.gz, or fpc.zst. (default guesses from
filename or is fpc)
-o, --output FILENAME Save the fingerprints to FILENAME
(default=stdout)
--out FORMAT Output format, one of 'fps', 'fps.gz',
'fps.zst', 'fpb', or 'flush' (default
guesses from output filename, or is 'fps')
--include-metadata / --no-metadata
With --no-metadata, do not include the
header metadata for FPS output.
--no-date Do not include the 'date' metadata in the
output header
--date STR An ISO 8601 date (like
'2025-02-07T11:10:15') to use for the 'date'
metadata in the output header
--progress / --no-progress Show a progress bar (default: show unless
the output is a terminal)
--help-methods Give more details about the conversion
methods
--help Show this message and exit.
Convert the count fingerprint into binary fingerprints using one of several
methods. The count fingerprints contain a sequence of 0 or more feature
entries, each with a feature id and non-negative count. The output is a
binary fingerprint of size --num-bits.
Here are the supported conversion methods. Use --help-methods for more
details about each one.
* The --superimpose method distributes the counts randomly. It takes three
parameters: --num-bits, --bits-per-count, and --max-count. For each feature
id it seeds an RNG then selects count * bits_per_count randomly chosen
values up to num_bits and sets the corresponding output bit to 1. This is
the default method.
* The --scaled method is a variation of the --superimpose method with
rescaled counts. It takes three parameters: --num-bits, --table and --scale.
It might be used to implement a form of TF-IDF (Term Frequency-Inverse
Document Frequency). Contact me if you want to explore this option.
* The --fold method uses the feature id modulo the number of bits to set the
output bit. The counts are ignored. It takes one parameter: --num-bits.
* The --rdkit-count-sim method emulates the RDKit count simulation
algorithm. It takes two parameters: --num-bits and --countBounds. The
countBounds contains a list of minimum counts, like "1,2,4,8". See --help-
methods for full details.
* The --seq method maps dense count fingerprints into a sequence of bins
using unary count encoding. It takes two parameters: --sizes containing a
list of bin sizes, and an optional --num-bits. If not specified then --num-
bits is the sum of the bin sizes.
* The --seq-scaled method is a variation of the --seq method with rescaled
counts. It takes two parameters: --table, and an optional --num-bits.
The "scaled" and "seq-scaled" methods use a scale to convert from the
original feature count to the actual count to use, called "repeat". Each
scale contains 1 or more scale terms. Each scale term contains a minimum
count and a repeat. The scale terms must be in increasing minimum count
order. Given a feature count, the repeat for the scale term with the largest
minimum count <= count is used. The repeat is 0 if there is no matching
scale term.
The scale "2:1" uses a repeat of 1 if the count >= 2.
The scale "1:1,2:2,4:3,8:4,16:5,32:6,64:7,128:8" is equivalent to
int(log2(min(128, count))). (The count must be a positive integer.).
The "--scale" option specifies the default scale for the scaled method. By
default it is "1:1".
The "--table" option contains the scales to use for one or more named
feature ids. For example, "123->1:1,10:2/456,789->1:1,8:2,64:3" uses the
scale "1:1,10:2" for sparse feature bit 123 and the scale "1:1,8:2,64:3" for
the sparse feature bits 456 and 789.
NOTE: The fpc2fps tool is EXPERIMENTAL and the command-line API is NOT
STABLE. Please provide feedback to help guide its development.
Supported fpc2fps methods¶
The following description of methods to convert sparse count
fingerprints to binary fingerprints comes from chemfp fpc2fps
--help-methods
:
Here are more details about the fpc2fps methods to convert a sparse
fingerprint into a binary fingerprint.
To start, an input sparse count fingerprint contains zero or more features.
Each feature contains a feature id (also called bit number) and a count. The
input features are represented by a string. The string "*" means there are no
features. The string "505:47" means there is one feature with feature id 505
and count 47. The string "10,11:7,12,15:2" means there are three features;
features 10 and 12 have a count of 1, feature 11 has a count of 7, and feature
15 has a count of 2. (If the count isn't specified, it's assumed to be 1.)
The output binary fingerprint is "num_bits" bits long. This can be specified
with --num-bits or for the "seq" and "seq-scaled" methods computed from the
other arguments.
* The --superimpose method distributes the counts randomly. It takes three
parameters: --num-bits, --bits-per-count, and --max-count. For each feature id
it seeds an RNG then selects count * bits_per_count randomly chosen values up
to num_bits and sets the corresponding output bit to 1. This is the default
method.
This pseudocode describes the details:
output_fp = BinaryFingerprint(num_bits)
for feature_id, count in features:
rng = RNG(feature_id)
for _ in range(min(count, max_count) * bit):
bitno = rng.randrange(num_bits)
output_fp.SetOnBit(bitno)
Example:
fpc2fps --superimpose --num-bits 2048 --bit-per-count 2 --max-count 100
* The --scaled method is a variation of the --superimpose method with rescaled
counts. It takes three parameters: --num-bits, --table and --scale. It might
be used to implement a form of TF-IDF (Term Frequency-Inverse Document
Frequency). Contact me if you want to explore this option.
For each feature id and count, the method gets the corresponding scale for the
feature id from --table, if present, otherwise it uses the --scale scale, then
uses the count to get the repeat value, which is used to generate the
superimposed bits. This pseudocode describes the details:
output_fp = BinaryFingerprint(num_bits)
for feature_id, count in features:
scale = table.get(feature_id, default)
repeat_to_use = 0
for min_count, repeat in scale[feature_id]:
if min_count <= count:
repeat_to_use = repeat
else:
break
rng = RNG(feature_id)
for _ in range(repeat_to_use)
bitno = rng.randrange(num_bits)
output_fp.SetOnBit(bitno)
Example:
fpc2fps --table "1,5,30->4:1,6:2,10:3/22,23->1:1,2:2" \ --scale "1:1"
This says that feature ids 1, 5, and 30 are interpreted using the scale
"4:1,6:2,10:3" and feature ids 22 and 23 are interpreted using the scale
"1:1,2:2", while all other features are interpreted using the default scale
table "1:1".
* The --fold method uses the feature id modulo the number of bits to set the
output bit. The counts are ignored. It takes one parameter: --num-bits. This
pseudocode describes the details:
output_fp = BinaryFingerprint(num_bits)
for feature_id, count in features:
output_fp.SetOnBit(feature_id % num_bits)
Example:
fpc2fps --fold --num-bits 1024
* The --rdkit-count-sim method emulates the RDKit count simulation algorithm.
It takes two parameters: --num-bits and --countBounds. The countBounds
contains a list of *k* minimum counts, like "1,2,4,8".
It folds the sparse count fingerprint into a summed count fingerprint with an
effective size of num_bits // k, then for each folded bitno and summed count
it uses the countBounds to set 0 or more bits in the final binary fingerprint,
along the lines of this pseudocode:
output_fp = BinaryFingerprint(num_bits)
k = len(countBounds)
effective_size = num_bits // k
# Created a dense summed count fingerprint
counts = [0] * effective_size
for feature_id, count in features:
counts[feature_id % effective_size] += count
# Convert to the binary fingerprint
for folded_bitno, count in enumerate(counts):
for offset, min_count in enumerate(countBounds):
if min_count <= count:
output_fp.SetOnBit(folded_bitno*k + offset)
Example:
fpc2fps ---count-sim --countBounds 1,2,4,8,16
* The --seq method maps dense count fingerprints into a sequence of bins using
unary count encoding. It takes two parameters: --sizes containing a list of
bin sizes, and an optional --num-bits. If not specified then --num-bits is the
sum of the bin sizes.
A "dense" count fingerprint type has N distinct features (up to a few
hundred), with feature ids in sequential order 0, 1, ... (N-1).
Unary encoding uses one on-bit for each count, up to the total available size.
If the bin size is 5 then the unary encoding for 2 is "11000" and the unary
encoding for 5 is "11110", when expressed in binary.
This method maps each count feature to a bin, in sequential order. (Feature id
0 maps to the first bin, feature id 1 maps to the second, etc.) The feature
count is unary encoded into the bin, up to the size of the bin. This
pseudocode describes the details:
output_fp = BinaryFingerprint(num_bits)
for feature_id, count in features:
bin_size = sizes[feature_id]
bin_start = sum(sizes[:feature_id])
for bitno in range(bin_start, bin_start+min(count, bin_size)):
output_fp.SetOnBit(bitno)
Example:
fpc2fps --seq 100,100,50
This places the counts for feature 0 into the first 100 bits, the counts for
feature 1 into the second 100 bits, and the counts for feature 2 into the last
50 bits.
* The --seq-scaled method is a variation of the --seq method with rescaled
counts. It takes two parameters: --table, and an optional --num-bits.
This method maps each count feature to a bin, in sequential order. (Feature id
0 maps to the first bin, feature id 1 maps to the second, etc.) The feature
count is converted into a repeat count using the scale table for the feature
id. The repeat count is unary encoded into the bin, up to the size of the bin.
This pseudocode describes the details:
output_fp = BinaryFingerprint(num_bits)
sizes = determin_bin_sizes(table)
for feature_id, count in features:
repeat_to_use = 0
for min_count, repeat in table[feature_id]:
if min_count <= count:
repeat_to_use = repeat
else:
break
bin_size = sizes[feature_id]
bin_start = sum(sizes[:feature_id]
for bitno in range(bin_start, bin_start+repeat_to_use):
output_fp.SetOnBit(bitno)
Example:
fpc2fps --seq-scaled --table "0,2->1:1,2:2,4:3,8:4,16:5/1->1:1,3:2,7:3"
This defines two scales: "1:1,2:2,4:3,8:4,16:5" and "1:1,3:2,7:3" with sizes 5
and 3 bits, respectively. Features 0 and 2 use the first scale and feature 1
uses the second scale, giving three bins, of sizes 5, 3, and 5.
If the count for feature 0 is 16 or larger then the 5 bits of bin 0 is
"11111". Otherwise, if the count for feature 0 is 8 or larger then the bits of
bin 0 is "11110". Otherwise, if the count is 4 or larger then the bits are
"11100". Otherwise, if the count is 2 or larger then the bits are "110000".
Otherwise, the bits are "10000".