chemfp.csv_readers module

This module contains CSV file readers and methods to work with CSV dialects.

The CSV readers are based on Python’s ‘csv’ module, described at https://docs.python.org/3/library/csv.html . I found it hard to configure a new dialect using that module, and prefer using the get_dialect() function defined here.

The read_csv_rows() function returns a CSVRowReader which iterates over columns in a row.

The read_csv_ids_and_molecules_with_parser() function returns a CSVIdAndMoleculeReader which iterates over (id, molecule) pairs from columns in a CSV file. You probably want to use the “read_csv_ids_and_molecules()” function for a given toolkit wrapper than use this function directly.

The read_csv_ids_and_fingerprints() function returns a CSVFingerprintReader which extracts pre-computed fingerprints from the CSV file and iterates over them as (id, fingerprint) pairs.

This module is used by the chemfp csv2fps command-line tool.

exception chemfp.csv_readers.CSVColumnDecodeError(argname, titles, row, column, title, record_id, decoder_name, decode_err)

Bases: CSVColumnError

Exception raised if the fingerprint column could not be decoded

The additional public attributes (beyond those in the parent CSVColumnError) are:

decoder_name

the name of the decoder used

decode_err

a string describing the decoding error

exception chemfp.csv_readers.CSVColumnError(argname, titles, row, column, title, record_id)

Bases: IndexError, ChemFPError

Base class for CSV column error exceptions

The public attributes are:

msg

a string describing the error

argname

the internal/parameter name for the missing column

titles

the list of column titles, or None if no titles

row

the list of row values if processing rows, or None

column

the missing column index, starting from 1

title

the missing column title, or None

record_id

the record id for this row, if applicable and available

exception chemfp.csv_readers.CSVColumnIndexError(argname, titles, column)

Bases: CSVColumnError

Exception raised if the column specified by index is larger than the number of header titles

exception chemfp.csv_readers.CSVColumnTitleError(argname, titles, title)

Bases: CSVColumnError

Exception raised if the column specified by name is not found in the header titles

exception chemfp.csv_readers.CSVConfigurationError

Bases: Error, TypeError

Exception raised due to a CSV configuration error

class chemfp.csv_readers.CSVDialect(*, delimiter=_default_dialect.delimiter, quotechar=_default_dialect.quotechar, escapechar=_default_dialect.escapechar, doublequote=_default_dialect.doublequote, skipinitialspace=_default_dialect.skipinitialspace, quoting=_default_dialect.quoting, lineterminator=_default_dialect.lineterminator, strict=_default_dialect.strict)

Bases: Dialect

This is a subclass of csv.Dialect

For details about how the configuration attributes work see https://docs.python.org/3/library/csv.html .

classmethod from_dialect(dialect)

Return a new CSVDialect given a dialect name or Dialect-like object.

Raise a CSVDialectException if a named dialect or alias is unknown.

A Dialect-like object must have all of the expected CSVDialect attributes (‘delimiter’, ‘quotechar’, and so on.)

get_dialect_name()

Return the dialect name, if known, otherwise return None

exception chemfp.csv_readers.CSVDialectError

Bases: Error

Exception raised when a requested named dialect is not known

class chemfp.csv_readers.CSVFingerprintReader(metadata, id_fp_iterator, location, close, dialect, has_header, titles, id_column, fp_column)

Bases: FingerprintIterator

Read fingerprints from columns in a CSV file and iterate over the (id, fingerprint) pairs

The additional attributes beyond the FingerprintIterator are:

dialect

a CSVDialect describing the CSV reader configuration

has_header

True if the CSV reader was configured to read header titles

titles

a list of title names, or None if there was no header

id_column

the column index for the id

fp_column

the column index for the fingerprint record

The CSVFingerprintReader is also a context manager which closes the CSV file when done.

property fp_title: str | None

Get the fingerprint column title, or None if no titles

property id_title: str | None

Get the id column title, or None if no titles

class chemfp.csv_readers.CSVIdAndMoleculeReader(metadata, structure_reader, close, location, dialect, has_header, titles, id_column, mol_column)

Bases: IdAndMoleculeReader

Read structures from columns in a CSV file and iterate over the (id, toolkit molecule) pairs

The additional attributes beyond the parent IdAndMoleculeReader are:

dialect

a CSVDialect describing the CSV reader configuration

has_header

True if the CSV reader was configured to read header titles

titles

a list of title names, or None if there was no header

id_column

the column index for the id, or None if it comes from the molecule record

mol_column

the column index for the molecule record

Note

The id_column and mol_column values start with 1, so a value of 1 means the first column, 2 means the second, and so on.

The CSVIdAndMoleculeReader is also a context manager which closes the CSV file when done.

close()

Close the reader

If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set self.closed to False.

property id_title

Get the id column title, or None if no titles or id column is None

property mol_title

Get the molecule record column title, or None if there are no columns

class chemfp.csv_readers.CSVRowReader(dialect, has_header, titles, row_iter, close, location)

Bases: object

Information about the CSV row reader

This iterates across rows of a CSV file. When used as a context manager it closes the CSV file when done.

The public attributes are:

dialect: CSVDialect

A CSVDialect describing the CSV reader configuration.

has_header: bool

True if the CSV reader was configured to read header titles

titles: list of string, or None

A list of title names, or None if there was no header.

location: chemfp.io.Location

The current processing location.

closed: bool

True if the reader has been closed

close()

Close the reader

If the reader wasn’t previously closed then close it.

exception chemfp.csv_readers.CSVUnicodeDecodeError(encoding: str, object: bytes, start: int, end: int, reason: str, bytes_read: int)

Bases: UnicodeDecodeError

A subclass of UnicodeDecodeError used to get the error position in the CSV file

The normal UnicodeDecodeError only reports the location relative to the text block. The CSVUnicodeDecodeError includes the total number of bytes read before raising the exeception (as bytes_read). Use the following to computer the start and end locations relative to the entire file:

start_in_file = err.bytes_read - len(err.object) + err.start
end_in_file = err.bytes_read - len(err.object) + err.end

The attributes inherited from the UnicodeDecodeError are:

encoding: str

The name of the encoding.

object: bytes

The text block being decoded, as bytes.

start: int

The start location of the error in the block.

end: int

The end location of the error in the block.

reason: str

The reason for the error.

This class adds the following attribute:

bytes_read: int

The number of bytes read before reaching this text block.

property end_in_file: int

Return the end position relative to the start of the file

property start_in_file: int

Return the start position relative to the start of the file

chemfp.csv_readers.get_dialect(dialect: str | Dialect | CSVDialect = 'csv', **kwargs) CSVDialect

Create a new CSVDialect given a base dialect and optional modifiers.

If a configuration value is not specified in the keyword arguments then use the corresponding attribute of the dialect.

The dialect may be a string with the dialect name, or a Dialect-like object with the required CSV attributes. If the named dialect is unknown then raise a CSVDialectException exception. Raise an AttributeError if the Dialect-like object does not contain a required attribute.

Raise a TypeError if an unsupported keyword argument is passed in.

Parameters:
  • dialect (a string or Dialect instance) – the default dialect properties for the returned dialect

  • kwargs (keyword arguments) – overide the default configuration properties

Returns:

a CSVDialect

chemfp.csv_readers.get_dialect_name(dialect: Dialect | CSVDialect) str | None

Given a csv.Dialect or CSVDialect, try to figure out its name.

The default supported names are ‘csv’, ‘tsv’, as well as the dialects registered in csv.list_dialects(), of which ‘unix’ is the only relevant one.

Returns None if the dialect is unknown.

Parameters:

dialect (a csv.Dialect or a CSVDialect) – a dialect object

Returns:

str or None

chemfp.csv_readers.read_csv_ids_and_fingerprints(source: _typing.Source, *, id_column: int = 1, fp_column: int = 2, decoder: _Union[str, _typing.FingerprintDecoder] = 'hex', decoder_name: _Optional[str] = None, dialect: _Optional[DialectType] = None, has_header: bool = True, compression: _typing.Literal['auto', 'gz', 'zst', ''] = 'auto', errors: _typing.ErrorsNames = 'strict', csv_errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None, encoding: str = 'utf8', encoding_errors: str = 'strict')

Read ids and fingerprints from columns of a CSV file using a fingerprint decoder function.

Read from source, which may be a filename, a file-like object, or None to read from stdin.

Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.

Use decoder to describe how to decode the fingerprint. This is either a named decoder (see chemfp.encodings.get_decoder()) or a function which takes a string and returns a 2-element tuple of the number of bits and byte-string fingerprint. The number of bits may be None if the fingerprint size can be inferred from the fingerprint length. When decoder_name is not None, use it as the decoder name during error reporting, otherwise use decoder if it is a string.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source

  • id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier

  • mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record

  • decoder (a string or a function) – the decoder name or function used to parse the fingerprint record

  • decoder_name (a string, or None) – a label for the decoder text to use during error reporting

  • dialect (None, a string name, or a Dialect instance) – the CSV dialect

  • has_header (bool) – True if the first record contains titles, False of it does not

  • compression (string or None) – file compression format

  • csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string) – the name of the file’s character encoding

  • encoding_errors (string) – the method used handle decoding errors

Returns:

a CSVFingerprintReader iterating (id, fingerprint) pairs

chemfp.csv_readers.read_csv_ids_and_molecules_with_parser(source: _typing.Source, parse_id_and_mol: _typing.IdAndMolParser, *, id_column: int = 1, mol_column: int = 2, dialect: _typing.Optional[DialectType] = None, has_header: bool = True, compression: _typing.Literal['auto', 'gz', 'zst', ''] = 'auto', csv_errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None, encoding: str = 'utf8', encoding_errors: str = 'strict', record_format: _Optional[str] = None, record_args: _Optional[_typing.ReaderArgs] = None)

Read ids and molecules from column(s) of a CSV file using a molecule parser function.

Read from source, which may be a filename, a file-like object, or None to read from stdin.

The required parse_id_and_mol is a function which parses the molecule record (as a string) and returns the 2-element tuple containing the record id and molecule. The id is only used if id_column is None. If the molecule is None then the record will be skipped. If the parser raises a ParseError exception then the current location will be attached to the exception and re-raised. The toolkit make_id_and_molecule_parser() function returns an appropriate function.

Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

The record_format and record_args are used to set the “record_format” and “args” values of the returned reader’s FormatMetadata metadata object.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source

  • parse_id_and_mol (it must take a string and return an (id, molecule) pair) – the function used to parse a molecule record

  • id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier

  • mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record

  • dialect (None, a string name, or a Dialect instance) – the CSV dialect

  • has_header (bool) – True if the first record contains titles, False of it does not

  • compression (string or None) – file compression format

  • csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string) – the name of the file’s character encoding

  • encoding_errors (string) – the method used handle decoding errors

  • record_format (string) – the molecular structure format name

  • record_args (None or a dictionary of format reader or writer args) – the molecular structure format name

Returns:

a CSVIdAndMoleculeReader iterating (id, molecule) pairs

chemfp.csv_readers.read_csv_rows(source=None, dialect=None, has_header=True, compression='auto', location=None, encoding='utf8', encoding_errors='strict')

Read rows from a CSV file

Read from source, which may be a filename, a file-like object, or None (the default) to read from stdin.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source

  • dialect (None, a string name, or a Dialect instance) – the CSV dialect

  • has_header (bool) – True if the first record contains titles, False of it does not

  • compression (string or None) – file compression format

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string) – the name of the file’s character encoding

  • encoding_errors (string) – the method used handle decoding errors

Returns:

a CSVRowReader iterating rows