FPS format specification
The FPS format is an exchange format for cheminformatics fingerprints. It lets different fingerprint tools, from different vendors and data sources, work together. It is a text format designed to be easy to create and parse by software, and easy to read by humans.
It is not designed for high-performance use. For that you should look at the FPB format, available in the commercial version of chemfp.
Overall layout
Here is an example of an FPS file containing two 166-bit MACCS fingerprints generated by RDKit:
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2018.03.1 chemfp/3.2.1
#date=2018-05-15T15:18:48
000000003000000001d414d91323915380f138ea1f 58-08-2
00000000000000000002000000000404001410d61e 15687-27-1
An FPS file contains UTF-8 encoded text. The FPS format consists of an optional header followed by a zero or more fingerprint records. The header consists of an optional version line followed by zero or more metadata lines. The first character of each header line is "#". A fingerprint record line will never start with a "#". Each line must end with newline, which may be either a linefeed character (ASCII 13) or the two character sequence carriage return + newline (ASCII 10 followed by ASCII 13). The rest of this specification will omit the explicit mention of the newline.
Each metadata line is of the form:
'#' <key> '=' <value>
where <key>
matches the regular expression pattern
[A-Za-z_][A-Za-z0-9_]+
and value
is an arbitrary string.
Note: the value must neither start nor end with a whitespace character, and a parser should strip all leading and trailing whitespace characters.
Users of this format, and updated versions of this specification, may add new header lines so long as those header lines can be ignored without affecting the semantics of the format documented here. This requirement is meant to allow people to extend the format without breaking backwards compatibility.
Header
The header should contain the version line followed by the metadata
lines. The metadata lines should be in the order
num_bits
, type
, software
,
source
, and date
, for those lines which are
present in the metadata. The source
line may occur more than
once. It is an error if any of the other lines occur more than once.
version line
The version line, if present, must be "#FPS1". This line indicates that the rest of the file can be interpreted by this specification.
num_bits metadata
The num_bits
metadata value is a positive integer describing the
number of bits in the fingerprint. It exists as way to record the size
fingerprints which are not an integer multiple of 8 bits, like the
166-bit MACCS keys.
Let N
be the size of the first fingerprint, in bytes, so 2*N
is
the number of hex characters.
If num_bits
is not present then it is assumed to be the 8*N
.
If num_bits
is present then it must be in the range
8*(N-1)<num_bits<=8*N
.
If num_bits
is not present and there are no fingerprints then the
number of bits in a fingerprint is implementation defined.
The num_bits
metadata should be present to prevent the last case
from occurring.
type metadata
The type
metadata describes the fingerprint algorithm and the
parameters used to generate the fingerprints.
The type value may be treated as an opaque string, bearing in mind that it may not start or end with a whitespace character. In principle it could be a description like "Attempt #81".
In practice it should be machine parseable, so that a program can use the type information to generate the appropriate fingerprint given a new structure. For example, given a SMILES string and the file ABC.fps, a similarity seach program should be able to open ABC.fps, read the metadata, find the type line, parse the fingerprint information, figure out the appropriate toolkit, use the toolkit to parse the SMILES string, use the fingerprint parameters to generate the fingerprint, and finally use the fingerprint for a similarity search of the fingerprints in ABC.fps.
In addition, the type should be in canonical form, such that two fingerprint types have the same sequence of bytes if and only if they describe the same fingerprint generation method and its parameters. The goal is to make it possible to use a string comparison to test if two sets of fingerprints were generated with the same method.
Chemfp uses type strings which look like:
OpenBabel-FP2/1
OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
They follow the following grammar:
<base-name> ('/' <version>)? (<WS> <argument-name> '=' <argument-value>)*
where base-name
and version
may be any sequence of one or more
printable ASCII characters except for the space and '/' characters
(i.e., they match the regular expression pattern [! -.0-~]+
); WS
contains 1 or more whitespace characters (defined as CR, LF, TAB,
SPACE, and VT); argument-name
matches the pattern
[a-zA-Z_][a-zA-Z0-9_]*
and argument-value
is one or more printable
ASCII characters except for the space character (i.e., it matches the
pattern [!-~]+
).
In canonical form, only a single space character may be used for WS.
Other canonical type strings are possible. For example, the following shows an OEGraphSim type string for the Path fingerprint, which corresponds to the "OpenEye-Path/2" fingerprint type string shown earlier:
Path,ver=2.0.0,size=4096,bonds=0-5,atype=AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo,btype=Order|Chiral
Chemfp decided to use a different format for the OEGraphSim fingerprints because when chemfp started, OEGraphSim didn't have a fingerprint type string. In addition, chemfp supports multiple toolkits, so it uses the "OpenEye-" prefix to the name to indicate which "Path" fingerprint it refers to.
Most fingerprint types are toolkit-specific, so new fingerprint types should also use a toolkit-specific prefix in the name.
software metadata
The software
metadata describes the software components used to make
the fingerprint. As a rough guideline, these are the components which
are "fingerprint aware" and where a bug in the implementation may
affect the fingerprints as they appear in the FPS file. By convention
the components are listed so that the most significant components come
first.
For examples, the following comes from rdkit2fps
using the RDKit
release 2018.03.1, and chemfp version 3.2.1:
#software=RDKit/2018.03.1 chemfp/3.2.1
and the following comes from oe2fps
using the OEGraphSim 2.3 release
from 2017 and chemfp version 1.4:
#software=OEGraphSim/2.3.0 (20170613) chemfp/1.4
The comment term (the part inside of parenthesis) is meant as a place to add extra information about the component. For OEChem, the version term (after the '/') comes from OEGraphSimGetRelease() and the comment term come from OEGraphSimGetVersion().
The software
line is modeled on the
"User-Agent"
header of HTTP/1.1. The expected grammar is:
software ::= <product> (<WS> product)*
product ::= <PRODUCT-NAME> ('/' <PRODUCT-VERSION>)? (<WS> <comment>)*
comment ::= '(' <text> ')'
PRODUCT-NAME::= /[!-.0-~]+
PRODUCT-VERSION::= /[!-.0-~]+
WS ::= /\r\n\t\v /+
text ::= any sequence of printable ASCII characters containing
balanced parenthesis
source metadata
The source
metadata records the location of the input structures
used to generate the fingerprints. It helps keep track of data
provenance. The source
metadata line may occur 0 or more times, and
it should be interpreted as a list of sources.
In most cases the value on the source line will be a filename, though it may be a URL, database URI, or other identifier.
A filename may contain a newline or a byte sequence which cannot be encoded as UTF-8, making it impossible to be represented on an FPS source line. These unsupported characters/bytes must be removed or substituted.
If you would like the source value to be machine interpretable then the source should be written as a URI. Bear in mind that relative filenames, or absolute paths on different file systems, may still make it difficult to locate the original file.
File paths which look like URIs (for example, the files "file:example.smi" and "http:example.sdf") should be encoded as file URL. Currently none of the chemfp tools do this because of a lack of demand for machine-interpretable source data.
date metadata
The date
metadata keeps track of the approximate time that the
fingerprints were generated. It should be some time between when the
fingerprint generation program started and ended, and most likely is a
time shortly after the file was opened.
The time is written as an ISO 8601 in the form:
YYYY-MM-DDThh:mm:ss
-or-
YYYY-MM-DDThh:mm:ss.sss
Timezone designators are not allowed. (2024 June: this decision is under review and will likely change.) The time should be given in UTC instead of local time. Fractional seconds are allowed but optional. (Note: before the release of chemfp 1.6/3.4, fractional seconds were neither allowed nor supported. In practice, it's often easier to generate a timestamp with fractional seconds than without, and some FPS tools ended up including it.)
For example:
#date=2017-12-25T13:05:26
indicates a time in the early afternoon Christmas Day of 2017 in Greenwich, UK.
Metadata extensions
Anyone may add new metadata lines so long as those lines may be ignored without affecting the semantic understanding of the rest of the file.
For example, in some of my files I use #comment
as a way to store
free-form comments.
CACTVS uses an extension to indicate if the fingerprints are molecule or reaction fingerprints.
An earlier version of this specification proposed that new header lines be marked with an "x-" or "X-". Following the advice of RFC 6648, that suggestion is withdrawn.
Fingerprint records
The fingerprint records appear after the header, with one fingerprint record per line.
Each record contains 2 or more tab-separated fields. The first field contains the hex-encoded fingerprint, the second field contains the identifier.
Fingerprint field
The hex encoding may use the upper-case characters 'A'-'F' or lower-case characters 'a'-'f'. It should use lower-case characters to make it easier for humans to read.
The first two (left-most) hex characters encode the first fingerprint byte, the second two hex characters encode the second fingerprint byte, and so on. This is sometimes referred to as a little-endian byte order.
The bits in the byte are recorded in big-endian order. The hex value "01" encodes a byte value of 1, the hex value "0a" encodes the byte value 10, and the hex value "c0" encodes the byte value 192.
The mixed-endian means that the first hex character corresponds to the second nibble (bits 4-7) of the fingerprint, the second hex character corresponds to the first nibble (bits 0-3) of the fingerprint, the third hex character corresponds to the fourth nibble (bits 12-15), the fourth hex character corresponds to the third nibble (bits 8-11), and so on.
If the fingerprint length is not an integer multiple of 8 bytes then additional 0 bits must be used to pad the fingerprint up to the next multiple of 8 bits. The most significant bits are used as the pad bits. For example, the hex-encoded 166-bit MACCS fingerprint will have bits 166 and 167 set to zero, to pad the fingerprint up to 168 bits.
Note: non-zero padding may cause some programs to give incorrect results.
The following shows an example of an FPS file containing a single 44-bit fingerprint where bits 0, 1, 4, 6, 9, 12, 16, 19, 29, 30, 31, 33, 34, 35, and 41 are set to 1:
#FPS1
#num_bits=44
531209e00e02 example
Why this hex encoding?
Two common questions people have about the FPS format are "why hex encoding instead of ...?" and "why mixed endian?"
The FPS format is meant to be a simple format which replaces the sorts of ad hoc formats that researchers often develop when working with fingerprint data. Any encoding must be easy to create, and easy to understand.
Hex encoding is available as a built-in operation in nearly every programming language. In those rare cases where it is not available, it's easy to find a function which does the conversion, or simply create one from scratch.
Hex encoding is also easy to understand. It's hard to think of something easier, except perhaps a list of "0" and "1" characters.
A list of 0 and 1 characters uses 8 bits to encode 1 bit. A hex character uses 2 bits to encode 1 bit, so it's four times more compact. The hex-encoding is a worthwhile tradeoff between space savings and complexity.
There are even more compact text representations, like base64 and Ascii85. Base64 encoding is part of the standard library of most programming languages, and Ascii85implementations are easy to find.
On the other hand, these encodings are harder for a human to understand. Consider the 64-bit fingerprint with bit 0 set. The different encodings are:
hex: 0100000000000000
base64: AQAAAAAAAAA=
Ascii85: 0RR9100000
It's easier for someone to look at the hex string and understand something about the content than it would be for the other formats.
It's true that base64 and Ascii85 are more compact. On the other hand, if space is an issue then the files should be compressed. My tests with gzip compression show that all three encodings compress to roughly the same size.
Why the mixed endianess? The actual bit order doesn't matter for something like a Tanimoto search or a substructure screen, so long as everything is consistent.
Where it does matter is for tasks like finding all MACCS fingerprints where key 20 (bit 19) is set. (This key is "contains silicon".) With little-endian byte order the test is simply:
bitno = 19;
bit_is_set = byte_fingerprint[bitno/8] & (1<<(bitno % 8));
With big-endian byte order the byte offset occurs from the right end. The indexing is slightly more complex because it requires a subtraction from the total fingerprint length.
The bit order in the byte comes naturally from the usual hex encoding of a byte value.
Identifier field
The identifier is UTF-8 encoded. It may contain a space character because some identifiers, like an IUPAC name, contain space characters.
It must not contain a tab character, one of the newline sequences, or NUL character. This specification does not define how to encode these unsupported values.
Duplicate identifiers are allowed, though this does not usually make sense.
Additional fields
Additional tab-separated fields may be used to associate additional data values with a given record.
There should be the same number of fields in each record.
A future version of this specification may include a mechanism to specify field names and data types in the header.
Changes:
3 Sept 2020: Documented that fractional seconds are now allowed in the
date
metadata.