chemfp.io module¶
Helper module for working with file I/O.
Only the Location
class is part of the public API and meant to
be used directly. The public API also returns ProgressBar
objects, which are part of the public API.
The other parts of this module are not part of the public API. (Let me know if anything else should be part of the public API.)
- class chemfp.io.ErrorHandler¶
Bases:
object
- class chemfp.io.Location(filename=None)¶
Bases:
object
Get location and other internal reader and writer state information
A Location instance gives a way to access information like the current record number, line number, and molecule object.
>>> import chemfp >>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", ... "ChEBI_lite.sdf.gz", id_tag="ChEBI ID") as reader: ... for id, fp in reader: ... if id == "CHEBI:3499": ... print("Record starts at line", reader.location.lineno) ... print("Record byte range:", reader.location.offsets) ... print("Number of atoms:", reader.location.mol.GetNumAtoms()) ... break ... [08:18:12] S group MUL ignored on line 103 Record starts at line 3599 Record byte range: (138171, 141791) Number of atoms: 36
Most of the readers and writers do not support all of the properties. Unsupported properties return a None. The filename is a read/write attribute and the other attributes are read-only.
If you don’t pass a location to the readers and writers then they will create a new one based on the source or destination, respectively. You can also pass in your own Location, created as
Location
if you have an actual filename, orLocation.from_source()
orLocation.from_destination()
if you have a more generic source or destination.- obj¶
A user-specified object connected with this location and any of its copies. You may assign or modify this value at any time.
- __init__(filename=None)¶
Use filename as the location’s filename
- property bytes_read: int | None¶
The number of bytes read to this point.
This is only valid in the location attached to a UnicodeDecodeError, which currently only occurs during CSV processing. It can be used with the error’s start, end, and object to get the error location relative to the entire file.
- property end_position: int | None¶
The (expected) end position in the file
The end_position, along with position is meant for progress information and is not necessarily the number of bytes in the current file.
For input files this is typically the file size, if available. Note that the actual file size may change while the file is being processed, in which case end_position will be invalid.
The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.
- filename¶
A string describing the source or destination, or None (read/write)
- property first_line: str | None¶
The first line of the current record, as a string.
The newline and any preceeding control return characters are not included.
The first line is decoded as UTF-8, with unknown characters replaced with ‘?’. Use
first_line_bytes
if this is not appropriate.
- property first_line_bytes: bytes | None¶
The first line of the current record, as a byte string.
The newline and any preceeding control return characters are not included.
If you want a text/Unicode string and the input record is UTF-8 encoded then use
first_line
.
- classmethod from_destination(destination: str | bytes | Path | None | BinaryIO) Location ¶
Create a Location instance based on the destination
If destination is a string then it’s used as the filename. If destination is None then the location filename is “<stdout>”. If destination is a file object then its
name
attribute is used as the filename, or None if there is no attribute.
- classmethod from_source(source: str | bytes | Path | None | BinaryIO) Location ¶
Create a Location instance based on the source
If source is a string then it’s used as the filename. If source is None then the location filename is “<stdin>”. If source is a file object then its
name
attribute is used as the filename, or None if there is no attribute.
- property lineno: int | None¶
The current line number, starting from 1.
This will be 0 if file processing has not yet started.
- property mol: Any | None¶
The molecule object for the current record.
- property offsets: Tuple[int, int] | None¶
The (start, end) byte offsets, starting from 0.
start is the record start byte position and end is one byte past the last byte of the record.
- property output_recno: int | None¶
The number of records actually written to the file or string.
The value
recno - output_recno
is the number of records sent to the writer but which had an error and could not be written to the output.
- property position: int | None¶
The (approximate) current position in the file.
The position, along with end_position should only be used for progress information. They do not correspond to the start or end position of the current record, nor to the sum of the record sizes.
For example, for compressed files this may be be the location of the end of the most recently read compressed data.
Several records may have the same position, and position may equal end_position even if more records are available.
The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.
- property position_units: str | None¶
The units used to measure position and end_position
This is typically “bytes” for input readers and “records” for output readers.
- progress_bar(**kwargs) ProgressBar ¶
Return a
ProgressBar
given the location and kwargs values.The
ProgressBar
is a wrapper around tqdm. if the kwargprogress
is True (the default) then the progress bar will be displayed. If False then a progress bar will not be used. If it is callable, then it will be called instead of calling the tqdm constructor. It must understand the tqdm constructor parameters.The remaining kwargs are passed to the tqdm constructor. If unspecified, the
progress_bar()
method will add some kwargs depending on what is available from theLocation
.If
desc
is not specified, the a default is created based on the optionalfile_count
kwarg followed by the location’s filename.A ProgressBar only shows the progress for a single file. When processing multiple files, use
file_count
to add information to the defaultdesc
about the how many files have been processed and how many are going to be processed.If
file_count
is not None then it must be a 2-element tuple containing the current file index (starting at 0) and the number of files to process. The number of files to process may be None if that number if not known. Thefile_count
(1, 5) results in the description “(2/5)” and thefile_count
(1, None) results in the description “(2)”.The progress bar will show one of several types of progress, depending on what information is available from the location.
If it’s possible to get a
position
(typically the byte position in the input file or the compressed input file) then that is used. The tqdbmunits
is set if the location has aposition_units
and, ifposition_units
is “bytes” then tqdm’sunit_scale
is set to 1. The tqdminitial
is set to the current position.Otherwise, if it’s possible to get a recno then it is used for the progress information, with tqdm
units
set to “recs” and the tqdminitial
is set to the current record number (which is likely 0).Otherwise, the tqdm
units
will be “its”, which is the tqdm default.The tqdm
total
are set to the location’send_position
, if available. This must only be available when it’s possible to get the currentposition
.The tqdm
dynamic_ncols
is set to True, to allow terminal resizing to update the tqdm output.Again, if you set the values in the kwargs then they will not be overwritten but will be passed directly to the tqdm constructor (or to the
progress
callable).- Parameters:
progress (boolean or callable) – True to show a progress bar, False to not show one, or a callable to use instead of the tqdm constructor
file_count (None, or (i, None), or (i, N)) – optional file index and number of files, used as the start of ‘desc’
**kwargs – kwargs passed to the tqdm constructor or progress callable.
- Returns:
- property recno: int | None¶
The current record number
For writers this is the number of records sent to the writer, and output_recno is the number of records sucessfully written to the file or string.
- property record: str | bytes | None¶
The current record as an uncompressed byte or text string.
- property record_format: str | None¶
The name of the record format.
This is a string like “smi”, “fps”, or “sdf”, without any compression suffix.
- property row: Sequence[str] | None¶
The row columns, when reading a CSV file.
- property start_position: int | None¶
The position when the location was attached to the file
The start_position and end_position, along with position are meant for progress information and are not necessarily the number of bytes in the current file.
For input files this is typically 0.
The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.
- where() str ¶
Return a human readable description about the current reader or writer state.
The description will contain the filename, line number, record number, and up to the first 40 characters of the first line of the record, if those properties are available.
- class chemfp.io.ProgressBar(tqdm, start_position: int | None, get_position: Callable[[], int] | None)¶
Bases:
object
Use a tdqm progress bar to track location progress.
The progress shown depends on the capabilities of the
Location
.If position information is available, use it, along with optional units and end position information.
Otherwise, display the record or iteration progress.
See
Location.progress_bar()
for details about how to configure and create a ProgressBar. This class should not be called directly.The public attributes are:
- start_position: int¶
The start position.
- current_position: int¶
The current position, updated after each call to
ProgressBar.update()
.
- tqdm¶
The underlying tqdm or other progress bar.
- __call__(it)¶
Wrap an iterator and update the toolbar after yielding each item
Example of use:
with location.progress_bar() as progress_bar: writer.write_fingerprints(progress_bar(reader))
- __enter__()¶
Context manager to close the progress bar upon completion.
- __exit__(type, value, traceback)¶
Close the progress bar when the context ends.
- close()¶
Close the progress bar.
- update(ignored: Any = 0)¶
Update the toolbar to the current progress.
Use this to update the progress bar manually when not being used as an iterator wrapper.
Example of use:
with location.progress_bar() as progress_bar: for i, mol in enumerate(mol_reader): if i % 100 == 0: # only update after every 100 molecules progress_bar.update()
Note: The function takes an optional ‘ignored’ parameter for API compatibility with the tqdm progress bar update() method. The value is ignored.