chemfp.csv_readers module¶
This module contains CSV file readers and methods to work with CSV dialects.
The CSV readers are based on Python’s ‘csv’ module, described at
https://docs.python.org/3/library/csv.html . I found it hard to
configure a new dialect using that module, and prefer using the
get_dialect()
function defined here.
The read_csv_rows()
function returns a CSVRowReader
which iterates over columns in a row.
The read_csv_ids_and_molecules_with_parser()
function returns a
CSVIdAndMoleculeReader
which iterates over (id, molecule)
pairs from columns in a CSV file. You probably want to use the
“read_csv_ids_and_molecules()” function for a given toolkit
wrapper than use this function directly.
The read_csv_ids_and_fingerprints()
function returns a
CSVFingerprintReader
which extracts pre-computed fingerprints
from the CSV file and iterates over them as (id, fingerprint) pairs.
This module is used by the chemfp csv2fps command-line tool.
- exception chemfp.csv_readers.CSVColumnDecodeError(argname, titles, row, column, title, record_id, decoder_name, decode_err)¶
Bases:
CSVColumnError
Exception raised if the fingerprint column could not be decoded
The additional public attributes (beyond those in the parent
CSVColumnError
) are:- decoder_name¶
the name of the decoder used
- decode_err¶
a string describing the decoding error
- exception chemfp.csv_readers.CSVColumnError(argname, titles, row, column, title, record_id)¶
Bases:
IndexError
,ChemFPError
Base class for CSV column error exceptions
The public attributes are:
- msg¶
a string describing the error
- argname¶
the internal/parameter name for the missing column
- titles¶
the list of column titles, or None if no titles
- row¶
the list of row values if processing rows, or None
- column¶
the missing column index, starting from 1
- title¶
the missing column title, or None
- record_id¶
the record id for this row, if applicable and available
- exception chemfp.csv_readers.CSVColumnIndexError(argname, titles, column)¶
Bases:
CSVColumnError
Exception raised if the column specified by index is larger than the number of header titles
- exception chemfp.csv_readers.CSVColumnTitleError(argname, titles, title)¶
Bases:
CSVColumnError
Exception raised if the column specified by name is not found in the header titles
- exception chemfp.csv_readers.CSVConfigurationError¶
Bases:
Error
,TypeError
Exception raised due to a CSV configuration error
- class chemfp.csv_readers.CSVDialect(*, delimiter=_default_dialect.delimiter, quotechar=_default_dialect.quotechar, escapechar=_default_dialect.escapechar, doublequote=_default_dialect.doublequote, skipinitialspace=_default_dialect.skipinitialspace, quoting=_default_dialect.quoting, lineterminator=_default_dialect.lineterminator, strict=_default_dialect.strict)¶
Bases:
Dialect
This is a subclass of csv.Dialect
For details about how the configuration attributes work see https://docs.python.org/3/library/csv.html .
- classmethod from_dialect(dialect)¶
Return a new CSVDialect given a dialect name or Dialect-like object.
Raise a CSVDialectException if a named dialect or alias is unknown.
A Dialect-like object must have all of the expected CSVDialect attributes (‘delimiter’, ‘quotechar’, and so on.)
- get_dialect_name()¶
Return the dialect name, if known, otherwise return None
- exception chemfp.csv_readers.CSVDialectError¶
Bases:
Error
Exception raised when a requested named dialect is not known
- class chemfp.csv_readers.CSVFingerprintReader(metadata, id_fp_iterator, location, close, dialect, has_header, titles, id_column, fp_column)¶
Bases:
FingerprintIterator
Read fingerprints from columns in a CSV file and iterate over the (id, fingerprint) pairs
The additional attributes beyond the
FingerprintIterator
are:- dialect¶
a
CSVDialect
describing the CSV reader configuration
- has_header¶
True if the CSV reader was configured to read header titles
- titles¶
a list of title names, or None if there was no header
- id_column¶
the column index for the id
- fp_column¶
the column index for the fingerprint record
The CSVFingerprintReader is also a context manager which closes the CSV file when done.
- property fp_title: str | None¶
Get the fingerprint column title, or None if no titles
- property id_title: str | None¶
Get the id column title, or None if no titles
- class chemfp.csv_readers.CSVIdAndMoleculeReader(metadata, structure_reader, close, location, dialect, has_header, titles, id_column, mol_column)¶
Bases:
IdAndMoleculeReader
Read structures from columns in a CSV file and iterate over the (id, toolkit molecule) pairs
The additional attributes beyond the parent
IdAndMoleculeReader
are:- dialect¶
a
CSVDialect
describing the CSV reader configuration
- has_header¶
True if the CSV reader was configured to read header titles
- titles¶
a list of title names, or None if there was no header
- id_column¶
the column index for the id, or None if it comes from the molecule record
- mol_column¶
the column index for the molecule record
Note
The id_column and mol_column values start with 1, so a value of 1 means the first column, 2 means the second, and so on.
The CSVIdAndMoleculeReader is also a context manager which closes the CSV file when done.
- close()¶
Close the reader
If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set
self.closed
to False.
- property id_title¶
Get the id column title, or None if no titles or id column is None
- property mol_title¶
Get the molecule record column title, or None if there are no columns
- class chemfp.csv_readers.CSVRowReader(dialect, has_header, titles, row_iter, close, location)¶
Bases:
object
Information about the CSV row reader
This iterates across rows of a CSV file. When used as a context manager it closes the CSV file when done.
The public attributes are:
- dialect: CSVDialect¶
A
CSVDialect
describing the CSV reader configuration.
- has_header: bool¶
True if the CSV reader was configured to read header titles
- titles: list of string, or None¶
A list of title names, or None if there was no header.
- location: chemfp.io.Location¶
The current processing location.
- closed: bool¶
True if the reader has been closed
- close()¶
Close the reader
If the reader wasn’t previously closed then close it.
- exception chemfp.csv_readers.CSVUnicodeDecodeError(encoding: str, object: bytes, start: int, end: int, reason: str, bytes_read: int)¶
Bases:
UnicodeDecodeError
A subclass of UnicodeDecodeError used to get the error position in the CSV file
The normal UnicodeDecodeError only reports the location relative to the text block. The CSVUnicodeDecodeError includes the total number of bytes read before raising the exeception (as bytes_read). Use the following to computer the start and end locations relative to the entire file:
start_in_file = err.bytes_read - len(err.object) + err.start end_in_file = err.bytes_read - len(err.object) + err.end
The attributes inherited from the UnicodeDecodeError are:
This class adds the following attribute:
- bytes_read: int¶
The number of bytes read before reaching this text block.
- property end_in_file: int¶
Return the end position relative to the start of the file
- property start_in_file: int¶
Return the start position relative to the start of the file
- chemfp.csv_readers.get_dialect(dialect: str | Dialect | CSVDialect = 'csv', **kwargs) CSVDialect ¶
Create a new
CSVDialect
given a base dialect and optional modifiers.If a configuration value is not specified in the keyword arguments then use the corresponding attribute of the dialect.
The dialect may be a string with the dialect name, or a Dialect-like object with the required CSV attributes. If the named dialect is unknown then raise a
CSVDialectException
exception. Raise an AttributeError if the Dialect-like object does not contain a required attribute.Raise a TypeError if an unsupported keyword argument is passed in.
- Parameters:
dialect (a string or Dialect instance) – the default dialect properties for the returned dialect
kwargs (keyword arguments) – overide the default configuration properties
- Returns:
- chemfp.csv_readers.get_dialect_name(dialect: Dialect | CSVDialect) str | None ¶
Given a csv.Dialect or CSVDialect, try to figure out its name.
The default supported names are ‘csv’, ‘tsv’, as well as the dialects registered in csv.list_dialects(), of which ‘unix’ is the only relevant one.
Returns None if the dialect is unknown.
- Parameters:
dialect (a csv.Dialect or a
CSVDialect
) – a dialect object- Returns:
str or None
- chemfp.csv_readers.read_csv_ids_and_fingerprints(source: _typing.Source, *, id_column: int = 1, fp_column: int = 2, decoder: _Union[str, _typing.FingerprintDecoder] = 'hex', decoder_name: _Optional[str] = None, dialect: _Optional[DialectType] = None, has_header: bool = True, compression: _typing.Literal['auto', 'gz', 'zst', ''] = 'auto', errors: _typing.ErrorsNames = 'strict', csv_errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None, encoding: str = 'utf8', encoding_errors: str = 'strict')¶
Read ids and fingerprints from columns of a CSV file using a fingerprint decoder function.
Read from source, which may be a filename, a file-like object, or None to read from stdin.
Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.
Use decoder to describe how to decode the fingerprint. This is either a named decoder (see
chemfp.encodings.get_decoder()
) or a function which takes a string and returns a 2-element tuple of the number of bits and byte-string fingerprint. The number of bits may be None if the fingerprint size can be inferred from the fingerprint length. When decoder_name is not None, use it as the decoder name during error reporting, otherwise use decoder if it is a string.Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a
CSVDialect
instance.If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
- Parameters:
source (a filename, file object, or None to read from stdin) – the CSV source
id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
decoder (a string or a function) – the decoder name or function used to parse the fingerprint record
decoder_name (a string, or None) – a label for the decoder text to use during error reporting
dialect (None, a string name, or a Dialect instance) – the CSV dialect
has_header (bool) – True if the first record contains titles, False of it does not
compression (string or None) – file compression format
csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string) – the name of the file’s character encoding
encoding_errors (string) – the method used handle decoding errors
- Returns:
a
CSVFingerprintReader
iterating (id, fingerprint) pairs
- chemfp.csv_readers.read_csv_ids_and_molecules_with_parser(source: _typing.Source, parse_id_and_mol: _typing.IdAndMolParser, *, id_column: int = 1, mol_column: int = 2, dialect: _typing.Optional[DialectType] = None, has_header: bool = True, compression: _typing.Literal['auto', 'gz', 'zst', ''] = 'auto', csv_errors: _typing.ErrorsNames = 'strict', location: _typing.OptionalLocation = None, encoding: str = 'utf8', encoding_errors: str = 'strict', record_format: _Optional[str] = None, record_args: _Optional[_typing.ReaderArgs] = None)¶
Read ids and molecules from column(s) of a CSV file using a molecule parser function.
Read from source, which may be a filename, a file-like object, or None to read from stdin.
The required parse_id_and_mol is a function which parses the molecule record (as a string) and returns the 2-element tuple containing the record id and molecule. The id is only used if id_column is None. If the molecule is None then the record will be skipped. If the parser raises a
ParseError
exception then the current location will be attached to the exception and re-raised. The toolkitmake_id_and_molecule_parser()
function returns an appropriate function.Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.
Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a
CSVDialect
instance.If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
The record_format and record_args are used to set the “record_format” and “args” values of the returned reader’s
FormatMetadata
metadata object.- Parameters:
source (a filename, file object, or None to read from stdin) – the CSV source
parse_id_and_mol (it must take a string and return an (id, molecule) pair) – the function used to parse a molecule record
id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
dialect (None, a string name, or a Dialect instance) – the CSV dialect
has_header (bool) – True if the first record contains titles, False of it does not
compression (string or None) – file compression format
csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string) – the name of the file’s character encoding
encoding_errors (string) – the method used handle decoding errors
record_format (string) – the molecular structure format name
record_args (None or a dictionary of format reader or writer args) – the molecular structure format name
- Returns:
a
CSVIdAndMoleculeReader
iterating (id, molecule) pairs
- chemfp.csv_readers.read_csv_rows(source=None, dialect=None, has_header=True, compression='auto', location=None, encoding='utf8', encoding_errors='strict')¶
Read rows from a CSV file
Read from source, which may be a filename, a file-like object, or None (the default) to read from stdin.
Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a
CSVDialect
instance.If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
- Parameters:
source (a filename, file object, or None to read from stdin) – the CSV source
dialect (None, a string name, or a Dialect instance) – the CSV dialect
has_header (bool) – True if the first record contains titles, False of it does not
compression (string or None) – file compression format
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string) – the name of the file’s character encoding
encoding_errors (string) – the method used handle decoding errors
- Returns:
a
CSVRowReader
iterating rows