FPC format specification

The FPC format is an exchange format for sparse count cheminformatics fingerprints. It lets different fingerprint tools, from different vendors and data sources, work together. It is a text format designed to be easy to create and parse by software, and easy to read by humans.

It is not designed for high-performance use.

Chemfp 5.0 can generate FPC files using the chemfp rdkit2fpc command, and convert FPC files to FPS or FPB files using the chemfp fpc2fps command.

For binary fingerprints see the FPS or FPB formats.

Overall layout

Here is an example of an FPC file containing two MorganCount fingerprints generated by RDKit:

#FPC1
#type=RDKit-MorganCount/2 radius=3 useFeatures=0
#software=RDKit/2024.09.5 chemfp/5.0
#date=2025-08-15T11:17:47+00:00
847433064,2245897107,2551683561 CHEMBL183419
864674487,1215180924,1600340860:2,2215059400,2245384272:2,2246728737:2,3542456614:2,3994088662:2    CHEMBL16264

An FPC file contains UTF-8 encoded text. The FPC format consists of an optional header followed by a zero or more sparse count fingerprint records. The header consists of an optional version line followed by zero or more metadata lines. The first character of each header line is "#". A fingerprint record line will never start with a "#". Each line must end with newline, which may be either a linefeed character (ASCII 13) or the two character sequence carriage return + newline (ASCII 10 followed by ASCII 13). The rest of this specification will omit the explicit mention of the newline.

Header

The header contains contain the optional version line followed by the metadata lines.

version line

The version line, if present, must be "#FPC1". This line indicates that the rest of the file can be interpreted by this specification.

metadata lines

The format of the FPC metadata lines is identical to the format of the FPS metadata lines. See the FPS format specification for details.

Note: Currently the num_bits metadata line is ignored.

Fingerprint records

The fingerprint records appear after the header, with one fingerprint record per line.

Each record contains 2 or more tab-separated fields. The first field contains the text-encoded sparse count fingerprint, the second field contains the identifier.

Sparse count fingerprint field

A sparse count fingerprint contains zero or more features. Each feature contains a bit number (also called a feature id) and a count. The bit number must be in the range 0 <= bitno < 2**64 and the count must be in the range 0 <= count < 2**32.

The empty fingerprint is encoded as the asterisk "*".

A feature may be encoded as "{bitno}:{count}". If the count is 1 then the feature may be encoded as "{bitno}", which has an implicit count of 1. For examples, the feature with bit number 4847 and count 1 may be encoded as "4847:1" or "4847", while the feature with bit number 134 with count 5 may be encoded as "134:5".

Leading zeros should not be included. Features with a count of 0 should not be included. Features with a count of 1 should use the implicit notation and not use ":1".

A non-empty fingerprint is encoded as comma-separated encoded features, which must be in increasing bit number order. For example, the fingerprint with two features "73" and "23:2" is encoded as "23:2,73" and not "73,23:2".

Identifier field

The identifier is UTF-8 encoded. It may contain a space character because some identifiers, like an IUPAC name, contain space characters.

It must not contain a tab character, one of the newline sequences, or NUL character. This specification does not define how to encode these unsupported values.

Duplicate identifiers are allowed, though this does not usually make sense.

Additional fields

Additional tab-separated fields may be used to associate additional data values with a given record.

There should be the same number of fields in each record.

A future version of this specification may include a mechanism to specify field names and data types in the header.

Changes:

15 August 2025 - first public description