FPS format specification

The FPS format is an exchange format for cheminformatics fingerprints. It lets different fingerprint tools, from different vendors and data sources, work together. It is a text format designed to be easy to create and parse by software, and easy to read by humans.

It is not designed for high-performance use. For that you should look at the FPB format, available in the commercial version of chemfp.

Overall layout

Here is an example of an FPS file containing two 166-bit MACCS fingerprints generated by RDKit:

#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2018.03.1 chemfp/3.2.1
#date=2018-05-15T15:18:48
000000003000000001d414d91323915380f138ea1f  58-08-2
00000000000000000002000000000404001410d61e  15687-27-1

An FPS file contains UTF-8 encoded text. The FPS format consists of an optional header followed by a zero or more fingerprint records. The header consists of an optional version line followed by zero or more metadata lines. The first character of each header line is "#". A fingerprint record line will never start with a "#". Each line must end with newline, which may be either a linefeed character (ASCII 13) or the two character sequence carriage return + newline (ASCII 10 followed by ASCII 13). The rest of this specification will omit the explicit mention of the newline.

Each metadata line is of the form:

'#' <key> '=' <value>

where <key> matches the regular expression pattern [A-Za-z_][A-Za-z0-9_]+ and value is an arbitrary string.

Note: the value must neither start nor end with a whitespace character, and a parser should strip all leading and trailing whitespace characters.

Users of this format, and updated versions of this specification, may add new header lines so long as those header lines can be ignored without affecting the semantics of the format documented here. This requirement is meant to allow people to extend the format without breaking backwards compatibility.

Header

The header should contain the version line followed by the metadata lines. The metadata lines should be in the order num_bits, type, software, source, and date, for those lines which are present in the metadata. The source line may occur more than once. It is an error if any of the other lines occur more than once.

version line

The version line, if present, must be "#FPS1". This line indicates that the rest of the file can be interpreted by this specification.

num_bits metadata

The num_bits metadata value is a positive integer describing the number of bits in the fingerprint. It exists as way to record the size fingerprints which are not an integer multiple of 8 bits, like the 166-bit MACCS keys.

Let N be the size of the first fingerprint, in bytes, so 2*N is the number of hex characters.

If num_bits is not present then it is assumed to be the 8*N.

If num_bits is present then it must be in the range 8*(N-1)<num_bits<=8*N.

If num_bits is not present and there are no fingerprints then the number of bits in a fingerprint is implementation defined.

The num_bits metadata should be present to prevent the last case from occurring.

type metadata

The type metadata describes the fingerprint algorithm and the parameters used to generate the fingerprints.

The type value may be treated as an opaque string, bearing in mind that it may not start or end with a whitespace character. In principle it could be a description like "Attempt #81".

In practice it should be machine parseable, so that a program can use the type information to generate the appropriate fingerprint given a new structure. For example, given a SMILES string and the file ABC.fps, a similarity seach program should be able to open ABC.fps, read the metadata, find the type line, parse the fingerprint information, figure out the appropriate toolkit, use the toolkit to parse the SMILES string, use the fingerprint parameters to generate the fingerprint, and finally use the fingerprint for a similarity search of the fingerprints in ABC.fps.

In addition, the type should be in canonical form, such that two fingerprint types have the same sequence of bytes if and only if they describe the same fingerprint generation method and its parameters. The goal is to make it possible to use a string comparison to test if two sets of fingerprints were generated with the same method.

Chemfp uses type strings which look like:

OpenBabel-FP2/1
OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1

They follow the following grammar:

<base-name> ('/' <version>)? (<WS> <argument-name> '=' <argument-value>)*

where base-name and version may be any sequence of one or more printable ASCII characters except for the space and '/' characters (i.e., they match the regular expression pattern [! -.0-~]+); WS contains 1 or more whitespace characters (defined as CR, LF, TAB, SPACE, and VT); argument-name matches the pattern [a-zA-Z_][a-zA-Z0-9_]* and argument-value is one or more printable ASCII characters except for the space character (i.e., it matches the pattern [!-~]+).

In canonical form, only a single space character may be used for WS.

Other canonical type strings are possible. For example, the following shows an OEGraphSim type string for the Path fingerprint, which corresponds to the "OpenEye-Path/2" fingerprint type string shown earlier:

Path,ver=2.0.0,size=4096,bonds=0-5,atype=AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo,btype=Order|Chiral

Chemfp decided to use a different format for the OEGraphSim fingerprints because when chemfp started, OEGraphSim didn't have a fingerprint type string. In addition, chemfp supports multiple toolkits, so it uses the "OpenEye-" prefix to the name to indicate which "Path" fingerprint it refers to.

Most fingerprint types are toolkit-specific, so new fingerprint types should also use a toolkit-specific prefix in the name.

software metadata

The software metadata describes the software components used to make the fingerprint. As a rough guideline, these are the components which are "fingerprint aware" and where a bug in the implementation may affect the fingerprints as they appear in the FPS file. By convention the components are listed so that the most significant components come first.

For examples, the following comes from rdkit2fps using the RDKit release 2018.03.1, and chemfp version 3.2.1:

#software=RDKit/2018.03.1 chemfp/3.2.1

and the following comes from oe2fps using the OEGraphSim 2.3 release from 2017 and chemfp version 1.4:

#software=OEGraphSim/2.3.0 (20170613) chemfp/1.4

The comment term (the part inside of parenthesis) is meant as a place to add extra information about the component. For OEChem, the version term (after the '/') comes from OEGraphSimGetRelease() and the comment term come from OEGraphSimGetVersion().

The software line is modeled on the "User-Agent" header of HTTP/1.1. The expected grammar is:

software ::= <product> (<WS> product)*
product ::= <PRODUCT-NAME> ('/' <PRODUCT-VERSION>)? (<WS> <comment>)*
comment ::= '(' <text> ')'
PRODUCT-NAME::= /[!-.0-~]+
PRODUCT-VERSION::= /[!-.0-~]+
WS ::= /\r\n\t\v /+
text ::= any sequence of printable ASCII characters containing
  balanced parenthesis

source metadata

The source metadata records the location of the input structures used to generate the fingerprints. It helps keep track of data provenance. The source metadata line may occur 0 or more times, and it should be interpreted as a list of sources.

In most cases the value on the source line will be a filename, though it may be a URL, database URI, or other identifier.

A filename may contain a newline or a byte sequence which cannot be encoded as UTF-8, making it impossible to be represented on an FPS source line. These unsupported characters/bytes must be removed or substituted.

If you would like the source value to be machine interpretable then the source should be written as a URI. Bear in mind that relative filenames, or absolute paths on different file systems, may still make it difficult to locate the original file.

File paths which look like URIs (for example, the files "file:example.smi" and "http:example.sdf") should be encoded as file URL. Currently none of the chemfp tools do this because of a lack of demand for machine-interpretable source data.

date metadata

The date metadata keeps track of the approximate time that the fingerprints were generated. It should be some time between when the fingerprint generation program started and ended, and most likely is a time shortly after the file was opened.

The time is written as an ISO 8601 in the form:

YYYY-MM-DDThh:mm:ss

Timezone designators are not allowed. The time should be given in UTC instead of local time. Fractional seconds are not allowed.

For example:

#date=2017-12-25T13:05:26

indicates a time in the early afternoon Christmas Day of 2017 in Greenwich, UK.

Metadata extensions

Anyone may add new metadata lines so long as those lines may be ignored without affecting the semantic understanding of the rest of the file.

For example, in some of my files I use #comment as a way to store free-form comments.

CACTVS uses an extension to indicate if the fingerprints are molecule or reaction fingerprints.

An earlier version of this specification proposed that new header lines be marked with an "x-" or "X-". Following the advice of RFC 6648, that suggestion is withdrawn.

Fingerprint records

The fingerprint records appear after the header, with one fingerprint record per line.

Each record contains 2 or more tab-separated fields. The first field contains the hex-encoded fingerprint, the second field contains the identifier.

Fingerprint field

The hex encoding may use the upper-case characters 'A'-'F' or lower-case characters 'a'-'f'. It should use lower-case characters to make it easier for humans to read.

The first two (left-most) hex characters encode the first fingerprint byte, the second two hex characters encode the second fingerprint byte, and so on. This is sometimes referred to as a little-endian byte order.

The bits in the byte are recorded in big-endian order. The hex value "01" encodes a byte value of 1, the hex value "0a" encodes the byte value 10, and the hex value "c0" encodes the byte value 192.

The mixed-endian means that the first hex character corresponds to the second nibble (bits 4-7) of the fingerprint, the second hex character corresponds to the first nibble (bits 0-3) of the fingerprint, the third hex character corresponds to the fourth nibble (bits 12-15), the fourth hex character corresponds to the third nibble (bits 8-11), and so on.

If the fingerprint length is not an integer multiple of 8 bytes then additional 0 bits must be used to pad the fingerprint up to the next multiple of 8 bits. The most significant bits are used as the pad bits. For example, the hex-encoded 166-bit MACCS fingerprint will have bits 166 and 167 set to zero, to pad the fingerprint up to 168 bits.

Note: non-zero padding may cause some programs to give incorrect results.

The following shows an example of an FPS file containing a single 44-bit fingerprint where bits 0, 1, 4, 6, 9, 12, 16, 19, 29, 30, 31, 33, 34, 35, and 41 are set to 1:

#FPS1
#num_bits=44
531209e00e02    example

Why this hex encoding?

Two common questions people have about the FPS format are "why hex encoding instead of ...?" and "why mixed endian?"

The FPS format is meant to be a simple format which replaces the sorts of ad hoc formats that researchers often develop when working with fingerprint data. Any encoding must be easy to create, and easy to understand.

Hex encoding is available as a built-in operation in nearly every programming language. In those rare cases where it is not available, it's easy to find a function which does the conversion, or simply create one from scratch.

Hex encoding is also easy to understand. It's hard to think of something easier, except perhaps a list of "0" and "1" characters.

A list of 0 and 1 characters uses 8 bits to encode 1 bit. A hex character uses 2 bits to encode 1 bit, so it's four times more compact. The hex-encoding is a worthwhile tradeoff between space savings and complexity.

There are even more compact text representations, like base64 and Ascii85. Base64 encoding is part of the standard library of most programming languages, and Ascii85implementations are easy to find.

On the other hand, these encodings are harder for a human to understand. Consider the 64-bit fingerprint with bit 0 set. The different encodings are:

hex: 0100000000000000
base64: AQAAAAAAAAA=
Ascii85: 0RR9100000

It's easier for someone to look at the hex string and understand something about the content than it would be for the other formats.

It's true that base64 and Ascii85 are more compact. On the other hand, if space is an issue then the files should be compressed. My tests with gzip compression show that all three encodings compress to roughly the same size.

Why the mixed endianess? The actual bit order doesn't matter for something like a Tanimoto search or a substructure screen, so long as everything is consistent.

Where it does matter is for tasks like finding all MACCS fingerprints where key 20 (bit 19) is set. (This key is "contains silicon".) With little-endian byte order the test is simply:

bitno = 19;
bit_is_set = byte_fingerprint[bitno/8] & (1<<(bitno % 8));

With big-endian byte order the byte offset occurs from the right end. The indexing is slightly more complex because it requires a subtraction from the total fingerprint length.

The bit order in the byte comes naturally from the usual hex encoding of a byte value.

Identifier field

The identifier is UTF-8 encoded. It may contain a space character because some identifiers, like an IUPAC name, contain space characters.

It must not contain a tab character, one of the newline sequences, or NUL character. This specification does not define how to encode these unsupported values.

Duplicate identifiers are allowed, though this does not usually make sense.

Additional fields

Additional tab-separated fields may be used to associate additional data values with a given record.

There should be the same number of fields in each record.

A future version of this specification may include a mechanism to specify field names and data types in the header.