chemfp.text_toolkit module¶
Methods to work with SD and SMILES files as records rather than molecules.
The text_toolkit implements the chemfp toolkit API but where the
“molecules” are simple `TextRecord
instances which store the
records as text strings. It does not use a back-end chemistry toolkit,
and it cannot convert between different chemistry representations.
The TextRecord is a base class. The actual records depend on the format, and will be one of:
The text toolkit will let you “convert” between the different SMILES
formats, but it doesn’t actually change the SMILES string. The SMILES
records have the attributes id
, record
and smiles
.
The toolkit also knows a bit about the SD format. The SDF records have
the attributes id
, id_bytes
and record
, and there are
methods to get SD tag values and add a tag to the end of the tag data
block.
The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord.
The record types also have the attributes encoding
and
encoding_errors
which affect how the record bytes are parsed.
- chemfp.text_toolkit.add_sdf_tag(sdf_record, tag, value)¶
Add an SD tag value to an SD record string
This will append the new tag and value to the end of the tag data block in the sdf_record string.
- Parameters:
sdf_record (string) – an SD record
tag (string) – a tag name
value (string) – the new tag value
- Returns:
a new SD record string with the new tag and value
- chemfp.text_toolkit.add_tag(mol, tag, value)¶
Add an SD tag value to the TextRecord
If the mol is in “sdf” format then this will modify
mol.record
to append the new tag and value to the end of the tag block. The other tags will not be modified, including tags with the same tag name.- Parameters:
mol (a
TextRecord
) – the text recordtag (string) – the SD tag name
value (string) – the text for the tag
- Returns:
None
- chemfp.text_toolkit.copy_molecule(mol)¶
Return a new TextRecord which is a copy of the given TextRecord
- Parameters:
mol (a
TextRecord
) – the text record- Returns:
a new
TextRecord
- chemfp.text_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors='strict', level=None)¶
Convert a TextRecord into a structure record in the given format as a byte string
If id is not None then use it instead of the molecule’s own id.
- Parameters:
mol (a
TextRecord
) – the molecule to use for the outputformat (a format name string, or Format object) – the output structure format
id (a string, or None to use the molecule's own id) – an alternate record id
writer_args (a dictionary) – writer arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
- Returns:
a byte string
- chemfp.text_toolkit.create_sdf(mol: Any, *, id: str | None = None, errors: str = 'strict') str | None ¶
Generate an SDF record from a
TextRecord
instanceThis is equivalent to calling:
create_string(mol, "sdf", id=id, writer_args={...}, errors=errors)
- Parameters:
mol (a
TextRecord
instance) – a molecule objectid (None or a string (default: None)) – an alternate identifier for the output record, if relevant
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a string, or None if errors are ignored
- chemfp.text_toolkit.create_smi(mol: Any, *, id: str | None = None, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None ¶
Generate a SMILES string and its id from a
TextRecord
instanceThis is equivalent to calling:
create_string(mol, "smi", id=id, writer_args={...}, errors=errors)
- Parameters:
mol (a
TextRecord
instance) – a molecule objectid (None or a string (default: None)) – an alternate identifier for the output record, if relevant
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id
cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a string, or None if errors are ignored
- chemfp.text_toolkit.create_smiles(mol: Any, *, id: str | None = None, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None ¶
Generate a SMILES string and its id from a
TextRecord
instanceThis is equivalent to calling:
create_string(mol, "smi", id=id, writer_args={...}, errors=errors)
- Parameters:
mol (a
TextRecord
instance) – a molecule objectid (None or a string (default: None)) – an alternate identifier for the output record, if relevant
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id
cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a string, or None if errors are ignored
- chemfp.text_toolkit.create_smistring(mol: Any, *, id: str | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None ¶
Generate a SMILES string from a
TextRecord
instanceThis is equivalent to calling:
create_string(mol, "smistring", id=id, writer_args={...}, errors=errors)
- Parameters:
mol (a
TextRecord
instance) – a molecule objectid (None or a string (default: None)) – an alternate identifier for the output record, if relevant
cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a string, or None if errors are ignored
- chemfp.text_toolkit.create_string(mol, format, id=None, writer_args=None, errors='strict')¶
Convert a TextRecord into a structure record in the given format as a Unicode string
If id is not None then use it instead of the molecule’s own id.
- Parameters:
mol (a
TextRecord
) – the molecule to use for the outputformat (a format name string, or Format object) – the output structure format
id (a string, or None to use the molecule's own id) – an alternate record id
writer_args (a dictionary) – writer arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
- Returns:
a Unicode string
- chemfp.text_toolkit.get_format(format_name)¶
Get the named format, or raise a ValueError
This will raise a ValueError for unknown format names.
- Parameters:
format_name – the format name
- Value format_name:
a string
- Returns:
a
chemfp.base_toolkit.Format
object
- chemfp.text_toolkit.get_formats(include_unavailable=False)¶
Get the list of structure formats that chemfp’s text toolkit supports
This version of chemfp will always support the structure formats available to chemfp so ‘include_unavailable’ does not affect anything. (It may affect other toolkits.)
- Parameters:
include_unavailable – include unavailable formats?
- Value include_unavailable:
True or False
- Returns:
a list of
chemfp.base_toolkit.Format
objects
- chemfp.text_toolkit.get_id(mol)¶
Get the molecule’s id from the TextRecord’s id field
This is toolkit-portable way to get
mol.id
.- Parameters:
mol (a TextRecord) – the molecule
- Returns:
a string
- chemfp.text_toolkit.get_input_format(format_name)¶
Get the named input format, or raise a ValueError
This will raise a ValueError for unknown format names or if that format is not an input format.
- Parameters:
format_name – the format name
- Value format_name:
a string
- Returns:
a
chemfp.base_toolkit.Format
object
- chemfp.text_toolkit.get_input_format_from_source(source=None, format=None)¶
Get the most appropriate format given the available source and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
- Parameters:
source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
format (A Format(-like) object, string, or None) – Format information, if known.
- Returns:
a
chemfp.base_toolkit.Format
object
- chemfp.text_toolkit.get_input_formats()¶
Get the list of supported chemfp text toolkit input formats
- Returns:
a list of
chemfp.base_toolkit.Format
objects
- chemfp.text_toolkit.get_output_format(format_name)¶
Get the named format, or raise a ValueError
This will raise a ValueError for unknown format names or if that format is not an output format.
- Parameters:
format_name – the format name
- Value format_name:
a string
- Returns:
a
chemfp.base_toolkit.Format
object
- chemfp.text_toolkit.get_output_format_from_destination(destination=None, format=None)¶
Get the most appropriate format given the available destination and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
- Parameters:
destination (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
format (A Format(-like) object, string, or None) – format information, if known.
- Returns:
A
chemfp.base_toolkit.Format
object
- chemfp.text_toolkit.get_output_formats()¶
Get the list of supported chemfp text toolkit output formats
- Returns:
a list of
chemfp.base_toolkit.Format
objects
- chemfp.text_toolkit.get_sdf_id(sdf_record)¶
Return the id for the SDF record string
The id is the first line of the sdf_record. A future version of this function may support an id_tag parameter. Let me know if that would be useful.
The returned id string will have the same type as the input sdf_record.
- Parameters:
sdf_record (string) – an SD record
- Returns:
the first line of the SD record
- chemfp.text_toolkit.get_sdf_tag(sdf_record, tag)¶
Return the value for a named tag in an SDF record string
Get the value for the tag named tag from the string sdf_record containing an SD record.
- Parameters:
sdf_record (string) – an SD record
tag (string) – a tag name
- Returns:
the corresponding tag value as a string, or None
- chemfp.text_toolkit.get_sdf_tag_pairs(sdf_record)¶
Return the (tag, value) entries in the SDF record string
Parse the sdf_record and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string.
- Parameters:
sdf_record (string) – an SDF record
- Returns:
a list of (tag, value) pairs
- chemfp.text_toolkit.get_tag(mol, tag)¶
Get the named SD tag value, or None if it doesn’t exist
If the mol is in “sdf” format then this will return the corresponding tag value from
mol.record
, or None if the tag does not exist.If the record is in any other format then it will return None.
- Parameters:
mol (a
TextRecord
) – the moleculetag (string) – the SD tag name
- Returns:
a string, or None
- chemfp.text_toolkit.get_tag_pairs(mol)¶
Get a list of all SD tag (name, value) pairs for the TextRecord
If the mol is in “sdf” format then this will return the list of (tag, value) pairs in
mol.record
, where the tag and value are strings.If the record is in any other format then it will return an empty list.
- Parameters:
mol (a
TextRecord
) – the molecule- Returns:
a list of (tag name, tag value) pairs
- chemfp.text_toolkit.is_licensed()¶
Return True - chemfp’s text toolkit is always licensed
- Returns:
True
- chemfp.text_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors='strict')¶
Create a specialized function which takes a record and returns an (id, TextRecord) pair
The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and
chemfp.text_toolkit.parse_id_and_molecule()
so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)See
chemfp.text_toolkit.read_molecules()
for details about the other parameters. The specificTextRecord
subclass returned depends on the format.- Parameters:
format (a format name string, or Format object) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
- Returns:
a function of the form
parser(record string) -> (id, text_record)
- chemfp.text_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', level=None)¶
Return a MoleculeWriter which can write TextRecord instances to a destination.
A
chemfp.base_toolkit.MoleculeWriter
has the methodswrite_molecule
,write_molecules
, andwrite_ids_and_molecules
, which are ways to write anTextRecord
, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file.TextRecords are written to destination. The output format can be a string like “sdf.gz” or “smi”, a
chemfp.base_toolkit.Format
, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.That said, the text toolkit doesn’t know how to convert between SMILES and SDF formats, and will raise an exception if you try.
The writer_args is only used for the “smi”, “can”, and “usm” output formats. The only supported parameter is:
* delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.- Parameters:
destination (a filename, file object, or None to write to stdout) – the structure destination
format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
writer_args (a dictionary) – writer arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track writer state informationencoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
- Returns:
a
chemfp.base_toolkit.MoleculeWriter
expectingTextRecord
instances
- chemfp.text_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors='strict', location=None, level=None)¶
Return a MoleculeStringWriter which can write TextRecord instances to a string.
See
chemfp.text_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a byte string.- Parameters:
format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
writer_args (a dictionary) – writer arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track writer state informationlevel (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
- Returns:
a
chemfp.base_toolkit.MoleculeStringWriter
expectingTextRecord
instances
- chemfp.text_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors='strict', location=None)¶
Return a MoleculeStringWriter which can write TextRecord instances to a string.
See
chemfp.text_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a Unicode string.- Parameters:
format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
writer_args (a dictionary) – writer arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track writer state information
- Returns:
a
chemfp.base_toolkit.MoleculeStringWriter
expectingTextRecord
instances
- chemfp.text_toolkit.open_sdf_writer(destination: None | str | BinaryIO, *, errors: str = 'strict')¶
Open an SDF file to write instances of
TextRecord
This is mostly equivalent to calling:
open_molecule_writer(destination, "sdf", writer_args={...}, errors=errors)
along with compression based on the destination filename’s extension.
- Parameters:
destination (None, a filename string, or a file-like object) – where to write the molecules
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeWriter
expecting instances ofTextRecord
- chemfp.text_toolkit.open_sdf_writer_to_string(*, errors: str = 'strict')¶
Open an SDF file to write instances of
TextRecord
to an in-memory stringThis is equivalent to calling:
open_molecule_writer_to_string("sdf", writer_args={...}, errors=errors)
Use write_molecules_to_string() to write compressed output.
- Parameters:
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeWriter
expecting instances ofTextRecord
- chemfp.text_toolkit.open_smi_writer(destination: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict')¶
Open a SMILES file to write instances of
TextRecord
This is mostly equivalent to calling:
open_molecule_writer(destination, "smi", writer_args={...}, errors=errors)
along with compression based on the destination filename’s extension.
- Parameters:
destination (None, a filename string, or a file-like object) – where to write the molecules
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id
cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeWriter
expecting instances ofTextRecord
- chemfp.text_toolkit.open_smi_writer_to_string(*, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict')¶
Open a SMILES file to write instances of
TextRecord
to an in-memory stringThis is equivalent to calling:
open_molecule_writer_to_string("smi", writer_args={...}, errors=errors)
Use write_molecules_to_string() to write compressed output.
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id
cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeWriter
expecting instances ofTextRecord
- chemfp.text_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors='strict')¶
Parse the first structure record from content and return the (id, TextRecord) pair.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters.See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.parse_molecule()
if just want theTextRecord
and not the the (id, TextRecord) pair.- Parameters:
content (a string) – the string containing a structure record
format (a format name string, or Format object) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
an (id,
TextRecord
molecule) pair
- chemfp.text_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')¶
Parse the first structure record from the content string and return a TextRecord.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.text_toolkit.read_molecules()
for details about the other parameters. Seechemfp.text_toolkit.parse_id_and_molecule()
if you want the (id,TextRecord
) pair instead of just the text record.- Parameters:
content (a string) – the string containing a structure record
format (a format name string, or Format object) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
- chemfp.text_toolkit.parse_sdf(content: str | bytes, *, errors: str = 'strict')¶
Parse an SDF record using the text toolkit
This is equivalent to calling:
parse_molecule(content, "sdf", reader_args={...}, errors=errors)
- Parameters:
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
TextRecord
instance object
- chemfp.text_toolkit.parse_smi(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Parse a SMILES string and its id using the text toolkit
This is equivalent to calling:
parse_molecule(content, "smi", reader_args={...}, errors=errors)
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
TextRecord
instance object
- chemfp.text_toolkit.parse_smiles(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Parse a SMILES string and its id using the text toolkit
This is equivalent to calling:
parse_molecule(content, "smi", reader_args={...}, errors=errors)
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
TextRecord
instance object
- chemfp.text_toolkit.parse_smistring(content: str | bytes, *, cxsmiles: bool = True, errors: str = 'strict')¶
Parse a SMILES string using the text toolkit
This is equivalent to calling:
parse_molecule(content, "smistring", reader_args={...}, errors=errors)
- Parameters:
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
TextRecord
instance object
- chemfp.text_toolkit.read_csv_ids_and_molecules(source, *, id_column=1, mol_column=2, dialect=None, has_header=True, compression='auto', format='smi', id_tag=None, reader_args=None, errors='report', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict')¶
Read ids and TextRecors from column(s) of a CSV file
Read from source, which may be a filename, a file-like object, or None to read from stdin.
Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.
Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.
If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
Use format to specify the structure format for how to parse the molecule column. The default of ‘smi’ will parse it as a SMILES string and, if id_column=None, will also parse any identifier.
The id_tag and reader_args arguments contain additional format configuration parameters.
The errors and csv_errors describe how to handle failures in molecule parsing and CSV parsing, respectively. The default is to report molecule parse failures to stderr, and to stop parsing if a CSV row does not contain enough columns.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
- Parameters:
source (a filename, file object, or None to read from stdin) – the CSV source
id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
dialect (None, a string name, or a Dialect instance) – the CSV dialect
has_header (bool) – True if the first record contains titles, False of it does not
compression (string or None) – file compression format
format (a format name string, or Format object) – the molecule structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle molecule parse errors
csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string) – the name of the file’s character encoding
encoding_errors (string) – the method used handle decoding errors
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, RDKit molecule) pairs
- chemfp.text_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')¶
Return an iterator that reads (id, TextRecord) pairs from a structure file
See
chemfp.text_toolkit.read_molecules()
for full parameter details. The major difference is that this returns an iterator of (id,TextRecord
) pairs instead of just the molecules.- Parameters:
source (a filename, file object, or None to read from stdin) – the structure source
format (a format name string, or Format object, or None to auto-detect) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
a
chemfp.text_toolkit.IdAndMoleculeReader
iterating (id,TextRecord
) pairs
- chemfp.text_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)¶
Return an iterator that reads (id, TextRecord) pairs from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.read_molecules_from_string()
if you just want to read theTextRecord
molecules instead of (id, TextRecord) pairs.- Parameters:
content (a string) – the string containing structure records
format (a format name string, or Format object) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id,TextRecord
) pairs
- chemfp.text_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')¶
Return an iterator that reads TextRecord instances from a structure file
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)
Only the SMILES formats use the reader_args dictionary. The supported parameters are:
delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
has_header - True or False
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.See
read_ids_and_molecules()
if you want (id,TextRecord
) pairs instead of just the molecules.- Parameters:
source (a filename, file object, or None to read from stdin) – the structure source
format (a format name string, or Format object, or None to auto-detect) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader parameters passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iteratingTextRecord
molecules
- chemfp.text_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)¶
Return an iterator that reads TextRecord instances from a string containing structure records
content is a string containing 0 or more records in the format format. See
read_molecules()
for details about the other parameters. Seeread_ids_and_molecules_from_string()
if you want to read (id,TextRecord
) pairs instead of just molecules.- Parameters:
content (a string) – the string containing structure records
format (a format name string, or Format object) – the input structure format
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (a dictionary) – reader arguments passed to the underlying toolkit
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state informationencoding (string (typically 'utf8' or 'latin1')) – the byte encoding
encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iteratingTextRecord
molecules
- chemfp.text_toolkit.read_sdf_ids_and_molecules(source: None | str | BinaryIO, *, id_tag: None | str = None, errors: str = 'strict')¶
Read ids and molecules from an SDF file using the text toolkit
This is mostly equivalent to calling:
read_ids_and_molecules(source, "sdf", id_tag=id_tag, reader_args={...}, errors=errors)
along with decompression based on the source filename’s extension.
- Parameters:
id_tag (a string, or None to use the title) – get the id from the named data item instead of using the record title
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_sdf_ids_and_molecules_from_string(content: str | bytes, *, id_tag: None | str = None, errors: str = 'strict')¶
Read ids and molecules from a string containing an SDF file using the text toolkit
This is equivalent to calling:
read_ids_and_molecules_from_string(content, "sdf", id_tag=id_tag, reader_args={...}, errors=errors)
Use read_ids_and_molecules_from_string() if the content is compressed.
- Parameters:
id_tag (a string, or None to use the title) – get the id from the named data item instead of using the record title
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_sdf_ids_and_records(source=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)¶
Return an iterator that reads the (id, record string) pairs from an SD file
See
read_sdf_records()
for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use id_tag to get the record id from the given SD tag instead.See
read_sdf_ids_and_values()
if you want to read an identifier and tag value, or two tag values, instead of returning the full record.- Parameters:
source (a filename, file object, or None to read from stdin) – the SDF source
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.IdAndRecordReader
iterating (id, record string) pairs
- chemfp.text_toolkit.read_sdf_ids_and_records_from_string(content=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)¶
Return an iterator that reads the (id, record) pairs from a string containing SD records
This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use id_tag to use a given tag value instead. See
read_sdf_records()
for details about the other parameters.If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.
If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id.
- Parameters:
content (string or bytes) – a string containing zero or more SD records
id_tag (string, or None to use the record title) – SD tag containing the record id
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.IdAndRecordReader
iterating over the (id, record string) pairs
- chemfp.text_toolkit.read_sdf_ids_and_values(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)¶
Return an iterator that reads the (id, tag value string) pairs from an SD file
See
read_sdf_records()
for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs.By default this uses the title line for both the id and tag value strings. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.
- Parameters:
source (a filename, file object, or None to read from stdin) – the SDF source
id_tag (string, or None to use the record title) – SD tag containing the record id
value_tag (string, or None to use the record title) – SD tag containing the value
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.IdAndRecordReader
iterating (id, value string) pairs
- chemfp.text_toolkit.read_sdf_ids_and_values_from_string(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)¶
Return an iterator that reads the (id, value) pairs from a string containing SD records
This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.
If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.
If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value.
See
read_sdf_records()
for details about the other parameters.- Parameters:
content (string or bytes) – a string containing zero or more SD records
id_tag (string, or None to use the record title) – SD tag containing the record id
value_tag (string, or None to use the record title) – SD tag containing the value
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.IdAndRecordReader
iterating over the (id, value) pairs
- chemfp.text_toolkit.read_sdf_molecules(source: None | str | BinaryIO, *, errors: str = 'strict')¶
Read molecules from an SDF file using the text toolkit
This is mostly equivalent to calling:
read_molecules(source, "sdf", reader_args={...}, errors=errors)
along with decompression based on the source filename’s extension.
- Parameters:
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_sdf_molecules_from_string(content: str | bytes, *, errors: str = 'strict')¶
Read molecules from a string containing an SDF file using the text toolkit
This is equivalent to calling:
read_molecules_from_string(content, "sdf", reader_args={...}, errors=errors)
Use read_molecules_from_string() if the content is compressed.
- Parameters:
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_sdf_records(source=None, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)¶
Return an iterator that reads each record from an SD file as a string.
Iterate through the records in source, which must be in SD format. If compression is None or “auto” then auto-detect the compression type based on source, and default to uncompressed when it can’t be determined. Use “gz” when the input is gzip compressed, and “none” or “” if uncompressed.
The reader_args parameter is currently unused. It exists for future compatability.
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The block_size parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn’t need to change this parameter, but if you do, please let me know.
Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger.
The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value ‘$$$$’. It is not a full validator and it does not know chemistry.
See
read_sdf_ids_and_records()
if you want (id, record) pairs, andread_sdf_ids_and_values()
if you want (id, tag data) pairs. Seeread_sdf_ids_and_records_from_string()
to read from a string instead of a file or file-like object.- Parameters:
source (a filename, file object, or None to read from stdin) – the SDF source
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.RecordReader()
iterating over the records as a string
- chemfp.text_toolkit.read_sdf_records_from_string(content, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)¶
Return an iterator that reads each record from a string containing SD records
See
read_sdf_records_from_string()
for the parameter details. The main difference is that this function reads from content, which is a string containing 0 or more SDF records.If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported.
See
read_sdf_ids_and_records_from_string()
to read (id, record) pairs andread_sdf_ids_and_values_from_string()
to read (id, tag value) pairs.- Parameters:
content (string or bytes) – a string containing zero or more SD records
reader_args (currently ignored) – currently ignored
compression (one of "auto", "none", "", or "gz") – the data content compression method
errors (one of "strict", "report", or "ignore") – specify how to handle errors
location (a
chemfp.io.Location
object, or None) – object used to track parser state information
- Returns:
a
chemfp.base_toolkit.RecordReader
iterating over each record as a string
- chemfp.text_toolkit.read_smi_ids_and_molecules(source: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Read ids and molecules from a SMILES file using the text toolkit
This is mostly equivalent to calling:
read_ids_and_molecules(source, "smi", reader_args={...}, errors=errors)
along with decompression based on the source filename’s extension.
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_smi_ids_and_molecules_from_string(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Read ids and molecules from a string containing a SMILES file using the text toolkit
This is equivalent to calling:
read_ids_and_molecules_from_string(content, "smi", reader_args={...}, errors=errors)
Use read_ids_and_molecules_from_string() if the content is compressed.
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.IdAndMoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_smi_molecules(source: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Read molecules from a SMILES file using the text toolkit
This is mostly equivalent to calling:
read_molecules(source, "smi", reader_args={...}, errors=errors)
along with decompression based on the source filename’s extension.
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.read_smi_molecules_from_string(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')¶
Read molecules from a string containing a SMILES file using the text toolkit
This is equivalent to calling:
read_molecules_from_string(content, "smi", reader_args={...}, errors=errors)
Use read_molecules_from_string() if the content is compressed.
- Parameters:
delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id
has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header
cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string
errors (one of "strict", "ignore", or "log") – specify how to handle errors
- Returns:
a
chemfp.base_toolkit.MoleculeReader
iterating instances ofTextRecord
- chemfp.text_toolkit.set_id(mol, id)¶
Set the TextRecord’s id to the new id
This is the toolkit-portable way to write
mol.id = id
.Note: this does not modify
mol.record
. Usechemfp.text_toolkit.create_string()
or similar text_toolkit functions to get the record text with a new identifier.- Parameters:
mol (a
TextRecord
) – the moleculeid (string) – the new id
- Returns:
None
- chemfp.text_toolkit.set_sdf_id(sdf_record, id)¶
Set the id of the SDF record string to a new value
Set the first line of sdf_record to the new id, which must not contain a newline.
The sdf_record and the id must have the same string type.
- Parameters:
sdf_record (string) – an SDF record
id (string) – the new id