chemfp.text_toolkit module

Methods to work with SD and SMILES files as records rather than molecules.

The text_toolkit implements the chemfp toolkit API but where the “molecules” are simple `TextRecord instances which store the records as text strings. It does not use a back-end chemistry toolkit, and it cannot convert between different chemistry representations.

The TextRecord is a base class. The actual records depend on the format, and will be one of:

The text toolkit will let you “convert” between the different SMILES formats, but it doesn’t actually change the SMILES string. The SMILES records have the attributes id, record and smiles.

The toolkit also knows a bit about the SD format. The SDF records have the attributes id, id_bytes and record, and there are methods to get SD tag values and add a tag to the end of the tag data block.

The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord.

The record types also have the attributes encoding and encoding_errors which affect how the record bytes are parsed.

chemfp.text_toolkit.add_sdf_tag(sdf_record, tag, value)

Add an SD tag value to an SD record string

This will append the new tag and value to the end of the tag data block in the sdf_record string.

Parameters:
  • sdf_record (string) – an SD record

  • tag (string) – a tag name

  • value (string) – the new tag value

Returns:

a new SD record string with the new tag and value

chemfp.text_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the TextRecord

If the mol is in “sdf” format then this will modify mol.record to append the new tag and value to the end of the tag block. The other tags will not be modified, including tags with the same tag name.

Parameters:
  • mol (a TextRecord) – the text record

  • tag (string) – the SD tag name

  • value (string) – the text for the tag

Returns:

None

chemfp.text_toolkit.copy_molecule(mol)

Return a new TextRecord which is a copy of the given TextRecord

Parameters:

mol (a TextRecord) – the text record

Returns:

a new TextRecord

chemfp.text_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors='strict', level=None)

Convert a TextRecord into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output

  • format (a format name string, or Format object) – the output structure format

  • id (a string, or None to use the molecule's own id) – an alternate record id

  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats

Returns:

a byte string

chemfp.text_toolkit.create_sdf(mol: Any, *, id: str | None = None, errors: str = 'strict') str | None

Generate an SDF record from a TextRecord instance

This is equivalent to calling:

create_string(mol, "sdf", id=id, writer_args={...}, errors=errors)
Parameters:
  • mol (a TextRecord instance) – a molecule object

  • id (None or a string (default: None)) – an alternate identifier for the output record, if relevant

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a string, or None if errors are ignored

chemfp.text_toolkit.create_smi(mol: Any, *, id: str | None = None, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None

Generate a SMILES string and its id from a TextRecord instance

This is equivalent to calling:

create_string(mol, "smi", id=id, writer_args={...}, errors=errors)
Parameters:
  • mol (a TextRecord instance) – a molecule object

  • id (None or a string (default: None)) – an alternate identifier for the output record, if relevant

  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id

  • cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a string, or None if errors are ignored

chemfp.text_toolkit.create_smiles(mol: Any, *, id: str | None = None, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None

Generate a SMILES string and its id from a TextRecord instance

This is equivalent to calling:

create_string(mol, "smi", id=id, writer_args={...}, errors=errors)
Parameters:
  • mol (a TextRecord instance) – a molecule object

  • id (None or a string (default: None)) – an alternate identifier for the output record, if relevant

  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id

  • cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a string, or None if errors are ignored

chemfp.text_toolkit.create_smistring(mol: Any, *, id: str | None = None, cxsmiles: bool = False, errors: str = 'strict') str | None

Generate a SMILES string from a TextRecord instance

This is equivalent to calling:

create_string(mol, "smistring", id=id, writer_args={...}, errors=errors)
Parameters:
  • mol (a TextRecord instance) – a molecule object

  • id (None or a string (default: None)) – an alternate identifier for the output record, if relevant

  • cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a string, or None if errors are ignored

chemfp.text_toolkit.create_string(mol, format, id=None, writer_args=None, errors='strict')

Convert a TextRecord into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output

  • format (a format name string, or Format object) – the output structure format

  • id (a string, or None to use the molecule's own id) – an alternate record id

  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

Returns:

a Unicode string

chemfp.text_toolkit.get_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names.

Parameters:

format_name – the format name

Value format_name:

a string

Returns:

a chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that chemfp’s text toolkit supports

This version of chemfp will always support the structure formats available to chemfp so ‘include_unavailable’ does not affect anything. (It may affect other toolkits.)

Parameters:

include_unavailable – include unavailable formats?

Value include_unavailable:

True or False

Returns:

a list of chemfp.base_toolkit.Format objects

chemfp.text_toolkit.get_id(mol)

Get the molecule’s id from the TextRecord’s id field

This is toolkit-portable way to get mol.id.

Parameters:

mol (a TextRecord) – the molecule

Returns:

a string

chemfp.text_toolkit.get_input_format(format_name)

Get the named input format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an input format.

Parameters:

format_name – the format name

Value format_name:

a string

Returns:

a chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.

  • format (A Format(-like) object, string, or None) – Format information, if known.

Returns:

a chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_input_formats()

Get the list of supported chemfp text toolkit input formats

Returns:

a list of chemfp.base_toolkit.Format objects

chemfp.text_toolkit.get_output_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an output format.

Parameters:

format_name – the format name

Value format_name:

a string

Returns:

a chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (A filename (as a string), a file object, or None to read from stdin) – The structure data source.

  • format (A Format(-like) object, string, or None) – format information, if known.

Returns:

A chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_output_formats()

Get the list of supported chemfp text toolkit output formats

Returns:

a list of chemfp.base_toolkit.Format objects

chemfp.text_toolkit.get_sdf_id(sdf_record)

Return the id for the SDF record string

The id is the first line of the sdf_record. A future version of this function may support an id_tag parameter. Let me know if that would be useful.

The returned id string will have the same type as the input sdf_record.

Parameters:

sdf_record (string) – an SD record

Returns:

the first line of the SD record

chemfp.text_toolkit.get_sdf_tag(sdf_record, tag)

Return the value for a named tag in an SDF record string

Get the value for the tag named tag from the string sdf_record containing an SD record.

Parameters:
  • sdf_record (string) – an SD record

  • tag (string) – a tag name

Returns:

the corresponding tag value as a string, or None

chemfp.text_toolkit.get_sdf_tag_pairs(sdf_record)

Return the (tag, value) entries in the SDF record string

Parse the sdf_record and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string.

Parameters:

sdf_record (string) – an SDF record

Returns:

a list of (tag, value) pairs

chemfp.text_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

If the mol is in “sdf” format then this will return the corresponding tag value from mol.record, or None if the tag does not exist.

If the record is in any other format then it will return None.

Parameters:
  • mol (a TextRecord) – the molecule

  • tag (string) – the SD tag name

Returns:

a string, or None

chemfp.text_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the TextRecord

If the mol is in “sdf” format then this will return the list of (tag, value) pairs in mol.record, where the tag and value are strings.

If the record is in any other format then it will return an empty list.

Parameters:

mol (a TextRecord) – the molecule

Returns:

a list of (tag name, tag value) pairs

chemfp.text_toolkit.is_licensed()

Return True - chemfp’s text toolkit is always licensed

Returns:

True

chemfp.text_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors='strict')

Create a specialized function which takes a record and returns an (id, TextRecord) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and chemfp.text_toolkit.parse_id_and_molecule() so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.text_toolkit.read_molecules() for details about the other parameters. The specific TextRecord subclass returned depends on the format.

Parameters:
  • format (a format name string, or Format object) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

Returns:

a function of the form parser(record string) -> (id, text_record)

chemfp.text_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', level=None)

Return a MoleculeWriter which can write TextRecord instances to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an TextRecord, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file.

TextRecords are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

That said, the text toolkit doesn’t know how to convert between SMILES and SDF formats, and will raise an exception if you try.

The writer_args is only used for the “smi”, “can”, and “usm” output formats. The only supported parameter is:

* delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination

  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format

  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track writer state information

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats

Returns:

a chemfp.base_toolkit.MoleculeWriter expecting TextRecord instances

chemfp.text_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors='strict', location=None, level=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format

  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track writer state information

  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats

Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

chemfp.text_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors='strict', location=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format

  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track writer state information

Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

chemfp.text_toolkit.open_sdf_writer(destination: None | str | BinaryIO, *, errors: str = 'strict')

Open an SDF file to write instances of TextRecord

This is mostly equivalent to calling:

open_molecule_writer(destination, "sdf", writer_args={...}, errors=errors)

along with compression based on the destination filename’s extension.

Parameters:
  • destination (None, a filename string, or a file-like object) – where to write the molecules

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeWriter expecting instances of TextRecord

chemfp.text_toolkit.open_sdf_writer_to_string(*, errors: str = 'strict')

Open an SDF file to write instances of TextRecord to an in-memory string

This is equivalent to calling:

open_molecule_writer_to_string("sdf", writer_args={...}, errors=errors)

Use write_molecules_to_string() to write compressed output.

Parameters:

errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeWriter expecting instances of TextRecord

chemfp.text_toolkit.open_smi_writer(destination: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict')

Open a SMILES file to write instances of TextRecord

This is mostly equivalent to calling:

open_molecule_writer(destination, "smi", writer_args={...}, errors=errors)

along with compression based on the destination filename’s extension.

Parameters:
  • destination (None, a filename string, or a file-like object) – where to write the molecules

  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id

  • cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeWriter expecting instances of TextRecord

chemfp.text_toolkit.open_smi_writer_to_string(*, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = None, cxsmiles: bool = False, errors: str = 'strict')

Open a SMILES file to write instances of TextRecord to an in-memory string

This is equivalent to calling:

open_molecule_writer_to_string("smi", writer_args={...}, errors=errors)

Use write_molecules_to_string() to write compressed output.

Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: None)) – The separator between the SMILES and the id

  • cxsmiles (Boolean (default: False)) – If true, include any CXSmiles term

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeWriter expecting instances of TextRecord

chemfp.text_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors='strict')

Parse the first structure record from content and return the (id, TextRecord) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters.

See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.parse_molecule() if just want the TextRecord and not the the (id, TextRecord) pair.

Parameters:
  • content (a string) – the string containing a structure record

  • format (a format name string, or Format object) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

an (id, TextRecord molecule) pair

chemfp.text_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')

Parse the first structure record from the content string and return a TextRecord.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.text_toolkit.read_molecules() for details about the other parameters. See chemfp.text_toolkit.parse_id_and_molecule() if you want the (id, TextRecord) pair instead of just the text record.

Parameters:
  • content (a string) – the string containing a structure record

  • format (a format name string, or Format object) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

a TextRecord

chemfp.text_toolkit.parse_sdf(content: str | bytes, *, errors: str = 'strict')

Parse an SDF record using the text toolkit

This is equivalent to calling:

parse_molecule(content, "sdf", reader_args={...}, errors=errors)
Parameters:

errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a TextRecord instance object

chemfp.text_toolkit.parse_smi(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Parse a SMILES string and its id using the text toolkit

This is equivalent to calling:

parse_molecule(content, "smi", reader_args={...}, errors=errors)
Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a TextRecord instance object

chemfp.text_toolkit.parse_smiles(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Parse a SMILES string and its id using the text toolkit

This is equivalent to calling:

parse_molecule(content, "smi", reader_args={...}, errors=errors)
Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a TextRecord instance object

chemfp.text_toolkit.parse_smistring(content: str | bytes, *, cxsmiles: bool = True, errors: str = 'strict')

Parse a SMILES string using the text toolkit

This is equivalent to calling:

parse_molecule(content, "smistring", reader_args={...}, errors=errors)
Parameters:
  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a TextRecord instance object

chemfp.text_toolkit.read_csv_ids_and_molecules(source, *, id_column=1, mol_column=2, dialect=None, has_header=True, compression='auto', format='smi', id_tag=None, reader_args=None, errors='report', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Read ids and TextRecors from column(s) of a CSV file

Read from source, which may be a filename, a file-like object, or None to read from stdin.

Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

Use format to specify the structure format for how to parse the molecule column. The default of ‘smi’ will parse it as a SMILES string and, if id_column=None, will also parse any identifier.

The id_tag and reader_args arguments contain additional format configuration parameters.

The errors and csv_errors describe how to handle failures in molecule parsing and CSV parsing, respectively. The default is to report molecule parse failures to stderr, and to stop parsing if a CSV row does not contain enough columns.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source

  • id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier

  • mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record

  • dialect (None, a string name, or a Dialect instance) – the CSV dialect

  • has_header (bool) – True if the first record contains titles, False of it does not

  • compression (string or None) – file compression format

  • format (a format name string, or Format object) – the molecule structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle molecule parse errors

  • csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string) – the name of the file’s character encoding

  • encoding_errors (string) – the method used handle decoding errors

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, RDKit molecule) pairs

chemfp.text_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Return an iterator that reads (id, TextRecord) pairs from a structure file

See chemfp.text_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source

  • format (a format name string, or Format object, or None to auto-detect) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

a chemfp.text_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

chemfp.text_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)

Return an iterator that reads (id, TextRecord) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.read_molecules_from_string() if you just want to read the TextRecord molecules instead of (id, TextRecord) pairs.

Parameters:
  • content (a string) – the string containing structure records

  • format (a format name string, or Format object) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

chemfp.text_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Return an iterator that reads TextRecord instances from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Only the SMILES formats use the reader_args dictionary. The supported parameters are:

  • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None

  • has_header - True or False

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See read_ids_and_molecules() if you want (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source

  • format (a format name string, or Format object, or None to auto-detect) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

chemfp.text_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)

Return an iterator that reads TextRecord instances from a string containing structure records

content is a string containing 0 or more records in the format format. See read_molecules() for details about the other parameters. See read_ids_and_molecules_from_string() if you want to read (id, TextRecord) pairs instead of just molecules.

Parameters:
  • content (a string) – the string containing structure records

  • format (a format name string, or Format object) – the input structure format

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding

  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure

Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

chemfp.text_toolkit.read_sdf_ids_and_molecules(source: None | str | BinaryIO, *, id_tag: None | str = None, errors: str = 'strict')

Read ids and molecules from an SDF file using the text toolkit

This is mostly equivalent to calling:

read_ids_and_molecules(source, "sdf", id_tag=id_tag, reader_args={...}, errors=errors)

along with decompression based on the source filename’s extension.

Parameters:
  • id_tag (a string, or None to use the title) – get the id from the named data item instead of using the record title

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_sdf_ids_and_molecules_from_string(content: str | bytes, *, id_tag: None | str = None, errors: str = 'strict')

Read ids and molecules from a string containing an SDF file using the text toolkit

This is equivalent to calling:

read_ids_and_molecules_from_string(content, "sdf", id_tag=id_tag, reader_args={...}, errors=errors)

Use read_ids_and_molecules_from_string() if the content is compressed.

Parameters:
  • id_tag (a string, or None to use the title) – get the id from the named data item instead of using the record title

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_sdf_ids_and_records(source=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, record string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use id_tag to get the record id from the given SD tag instead.

See read_sdf_ids_and_values() if you want to read an identifier and tag value, or two tag values, instead of returning the full record.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, record string) pairs

chemfp.text_toolkit.read_sdf_ids_and_records_from_string(content=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, record) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use id_tag to use a given tag value instead. See read_sdf_records() for details about the other parameters.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, record string) pairs

chemfp.text_toolkit.read_sdf_ids_and_values(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, tag value string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs.

By default this uses the title line for both the id and tag value strings. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • value_tag (string, or None to use the record title) – SD tag containing the value

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, value string) pairs

chemfp.text_toolkit.read_sdf_ids_and_values_from_string(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, value) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value.

See read_sdf_records() for details about the other parameters.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records

  • id_tag (string, or None to use the record title) – SD tag containing the record id

  • value_tag (string, or None to use the record title) – SD tag containing the value

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, value) pairs

chemfp.text_toolkit.read_sdf_molecules(source: None | str | BinaryIO, *, errors: str = 'strict')

Read molecules from an SDF file using the text toolkit

This is mostly equivalent to calling:

read_molecules(source, "sdf", reader_args={...}, errors=errors)

along with decompression based on the source filename’s extension.

Parameters:

errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_sdf_molecules_from_string(content: str | bytes, *, errors: str = 'strict')

Read molecules from a string containing an SDF file using the text toolkit

This is equivalent to calling:

read_molecules_from_string(content, "sdf", reader_args={...}, errors=errors)

Use read_molecules_from_string() if the content is compressed.

Parameters:

errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_sdf_records(source=None, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)

Return an iterator that reads each record from an SD file as a string.

Iterate through the records in source, which must be in SD format. If compression is None or “auto” then auto-detect the compression type based on source, and default to uncompressed when it can’t be determined. Use “gz” when the input is gzip compressed, and “none” or “” if uncompressed.

The reader_args parameter is currently unused. It exists for future compatability.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The block_size parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn’t need to change this parameter, but if you do, please let me know.

Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger.

The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value ‘$$$$’. It is not a full validator and it does not know chemistry.

See read_sdf_ids_and_records() if you want (id, record) pairs, and read_sdf_ids_and_values() if you want (id, tag data) pairs. See read_sdf_ids_and_records_from_string() to read from a string instead of a file or file-like object.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.RecordReader() iterating over the records as a string

chemfp.text_toolkit.read_sdf_records_from_string(content, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)

Return an iterator that reads each record from a string containing SD records

See read_sdf_records_from_string() for the parameter details. The main difference is that this function reads from content, which is a string containing 0 or more SDF records.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported.

See read_sdf_ids_and_records_from_string() to read (id, record) pairs and read_sdf_ids_and_values_from_string() to read (id, tag value) pairs.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records

  • reader_args (currently ignored) – currently ignored

  • compression (one of "auto", "none", "", or "gz") – the data content compression method

  • errors (one of "strict", "report", or "ignore") – specify how to handle errors

  • location (a chemfp.io.Location object, or None) – object used to track parser state information

Returns:

a chemfp.base_toolkit.RecordReader iterating over each record as a string

chemfp.text_toolkit.read_smi_ids_and_molecules(source: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Read ids and molecules from a SMILES file using the text toolkit

This is mostly equivalent to calling:

read_ids_and_molecules(source, "smi", reader_args={...}, errors=errors)

along with decompression based on the source filename’s extension.

Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_smi_ids_and_molecules_from_string(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Read ids and molecules from a string containing a SMILES file using the text toolkit

This is equivalent to calling:

read_ids_and_molecules_from_string(content, "smi", reader_args={...}, errors=errors)

Use read_ids_and_molecules_from_string() if the content is compressed.

Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_smi_molecules(source: None | str | BinaryIO, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Read molecules from a SMILES file using the text toolkit

This is mostly equivalent to calling:

read_molecules(source, "smi", reader_args={...}, errors=errors)

along with decompression based on the source filename’s extension.

Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.read_smi_molecules_from_string(content: str | bytes, *, delimiter: Literal['to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', ' ', '\t'] | None = 'to-eol', has_header: bool = False, cxsmiles: bool = True, errors: str = 'strict')

Read molecules from a string containing a SMILES file using the text toolkit

This is equivalent to calling:

read_molecules_from_string(content, "smi", reader_args={...}, errors=errors)

Use read_molecules_from_string() if the content is compressed.

Parameters:
  • delimiter (One of None, 'to-eol', 'space', 'tab', 'comma', 'whitespace', 'native', or the space or tab characters (default: "to-eol")) – The separator between the SMILES and the id

  • has_header (Boolean (default: False)) – If true, treat the first line of the SMILES file as a header

  • cxsmiles (Boolean (default: True)) – If true, look for ChemAxon CXSMILES extensions after the SMILES string

  • errors (one of "strict", "ignore", or "log") – specify how to handle errors

Returns:

a chemfp.base_toolkit.MoleculeReader iterating instances of TextRecord

chemfp.text_toolkit.set_id(mol, id)

Set the TextRecord’s id to the new id

This is the toolkit-portable way to write mol.id = id.

Note: this does not modify mol.record. Use chemfp.text_toolkit.create_string() or similar text_toolkit functions to get the record text with a new identifier.

Parameters:
  • mol (a TextRecord) – the molecule

  • id (string) – the new id

Returns:

None

chemfp.text_toolkit.set_sdf_id(sdf_record, id)

Set the id of the SDF record string to a new value

Set the first line of sdf_record to the new id, which must not contain a newline.

The sdf_record and the id must have the same string type.

Parameters:
  • sdf_record (string) – an SDF record

  • id (string) – the new id