data_points¶

Document types used to represent datatypes of individual documents in an IterableCorpus or subtype.

class DataPointType(*args, **kwargs)[source]¶

Bases: object

Base data-point type for iterable corpora. All iterable corpora should have data-point types that are subclasses of this.

Every data point type has a corresponding document class, which can be accessed as MyDataPointType.Document. When overriding data point types, you can define a nested Document class, with no base class, to override parts of the document class’ functionality or add new methods, etc. This will be used to automatically create the Document class for the data point type.

Some data-point types may specify some options, using the data_point_type_options field. This works in the same way as PimlicoDatatype’s datatype_options. Values for the options can be specified on initialization as args or kwargs of the data-point type.

Note

I have now implemented the data-point type options, just like datatype options. However, you cannot yet specify these in a config file when loading a stored corpus. An additional datatype option should be added to iterable corpora that allows you to specify data point type options for when a datatype is being loaded using a config file.

formatters = []¶: List of (name, cls_path) pairs specifying a standard set of formatters that the user might want to choose from to view a dataset of this type. The user is not restricted to this set, but can easily choose these by name, instead of specifying a class path themselves. The first in the list is the default used if no formatter is specified. Falls back to DefaultFormatter if empty

metadata_defaults = {}¶: Metadata keys that should be written for this data point type, with default values and strings documenting the meaning of the parameter. Used for writers for this data point type. See Writer.

data_point_type_options = {}¶

Options specified in the same way as module options that control the nature of the document type. These are not things to do with reading of specific datasets, for which the dataset’s metadata should be used. These are things that have an impact on typechecking, such that options on the two checked datatypes are required to match for the datatypes to be considered compatible.

This corresponds exactly to a PimlicoDatatype’s datatype_options and is processed in the same way.

They should always be an ordered dict, so that they can be specified using positional arguments as well as kwargs and config parameters.

data_point_type_supports_python2 = True¶

Most core Pimlico datatypes support use in Python 2 and 3. Datatypes that do should set this to True. If it is False, the datatype is assumed to work only in Python 3.

Python 2 compatibility requires extra work from the programmer. Datatypes should generally declare whether or not they provide this support by overriding this explicitly.

Use supports_python2() to check whether a data-point type instance supports Python 2. (There may be reasons for a datatype’s instance to override this class-level setting.)

supports_python2()[source]¶: Just returns data_point_type_supports_python2.

name¶

check_type(supplied_type)[source]¶

Type checking for an iterable corpus calls this to check that the supplied data point type matches the required one (i.e. this instance). By default, the supplied type is simply required to be an instance of the required type (or one of its subclasses).

This may be overridden to introduce other type checks.

is_type_for_doc(doc)[source]¶

Check whether the given document is of this type, or a subclass of this one.

If the object is not a document instance (or, more precisely, doesn’t have a data_point_type attr), this will always return False.

reader_init(reader)[source]¶

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]¶

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

classmethod full_class_name()[source]¶: The fully qualified name of the class for this data point type, by which it is referenced in config files. Used in docs

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: object

The abstract superclass of all documents.

You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each document type. You can add functionality to a datapoint type’s document by creating a nested Document class. This will inherit from the parent datapoint type’s document. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:

class MyDataPointType(DataPointType):
    class Document:
        # Overide document things here
        # Add your own methods, properties, etc for getting data from the document

A data point type’s constructed document class is available as MyDataPointType.Document.

Each document type should provide a method to convert from raw data (a bytes object in Py3, or future’s backport of bytes in Py2) to the internal representation (an arbitrary dictionary) called raw_to_internal(), and another to convert the other way called internal_to_raw(). Both forms of the data are available using the properties raw_data and internal_data, and these methods are called as necessary to convert back and forth.

This is to avoid unnecessary conversions. For example, if the raw data is supplied and then only the raw data is ever used (e.g. passing the document straight through and writing out to disk), we want to avoid converting back and forth.

A subtype should then supply methods or properties (typically using the cached_property decorator) to provide access to different parts of the data. See the many built-in document types for examples of doing this.

You should not generally need to override the __init__ method. You may, however, wish to override internal_available() or raw_available(). These are called as soon as the internal data or raw data, respectively, become available, which may be at instantiation or after conversion. This can be useful if there are bits of computation that you want to do on the basis of one of these and then store to avoid repeated computation.

keys = []¶: Specifies the keys that a document has in its internal data Subclasses should specify their keys The internal data fields corresponding to these can be accessed as attributes of the document

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

raw_available()[source]¶: Called as soon as the raw data becomes available, either at instantiation or conversion.

internal_available()[source]¶: Called as soon as the internal data becomes available, either at instantiation or conversion.

raw_data¶

internal_data¶

class InvalidDocument(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.DataPointType

Widely used in Pimlico to represent an empty document that is empty not because the original input document was empty, but because a module along the way had an error processing it. Document readers/writers should generally be robust to this and simply pass through the whole thing where possible, so that it’s always possible to work out, where one of these pops up, where the error occurred.

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for InvalidDocument

keys = ['module_name', 'error_info']¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

module_name¶

error_info¶

class RawDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.DataPointType

Base document type. All document types for grouped corpora should be subclasses of this.

It may be used itself as well, where documents are just treated as raw data, though most of the time it will be appropriate to use subclasses to provide more information and processing operations specific to the datatype.

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for RawDocumentType

keys = ['raw_data']¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class TextDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Documents that contain text, most often human-readable documents from a textual corpus. Most often used as a superclass for other, more specific, document types.

This type does not special processing, since the storage format is already a unicode string, which is fine for raw text. However, it serves to indicate that the document represents text (not just any old raw data).

The property text provides the text, which is, for this base type, just the raw data. However, subclasses will override this, since their raw data will contain information other than the raw text.

data_point_type_supports_python2 = True¶

formatters = [('text', 'pimlico.datatypes.corpora.formatters.text.TextDocumentFormatter')]¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for TextDocumentType

keys = ['text']¶

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

class RawTextDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.TextDocumentType

Subclass of TextDocumentType used to indicate that the text hasn’t been processed (tokenized, etc). Note that text that has been tokenized, parsed, etc does not use subclasses of this type, so they will not be considered compatible if this type is used as a requirement.

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for RawTextDocumentType

exception DataPointError[source]¶: Bases: Exception