data_points¶
Document types used to represent datatypes of individual documents in an IterableCorpus or subtype.
-
class
DataPointType
(*args, **kwargs)[source]¶ Bases:
object
Base data-point type for iterable corpora. All iterable corpora should have data-point types that are subclasses of this.
Every data point type has a corresponding document class, which can be accessed as MyDataPointType.Document. When overriding data point types, you can define a nested Document class, with no base class, to override parts of the document class’ functionality or add new methods, etc. This will be used to automatically create the Document class for the data point type.
Some data-point types may specify some options, using the
data_point_type_options
field. This works in the same way as PimlicoDatatype’sdatatype_options
. Values for the options can be specified on initialization as args or kwargs of the data-point type.Note
I have now implemented the data-point type options, just like datatype options. However, you cannot yet specify these in a config file when loading a stored corpus. An additional datatype option should be added to iterable corpora that allows you to specify data point type options for when a datatype is being loaded using a config file.
-
formatters
= []¶ List of (name, cls_path) pairs specifying a standard set of formatters that the user might want to choose from to view a dataset of this type. The user is not restricted to this set, but can easily choose these by name, instead of specifying a class path themselves. The first in the list is the default used if no formatter is specified. Falls back to DefaultFormatter if empty
-
metadata_defaults
= {}¶ Metadata keys that should be written for this data point type, with default values and strings documenting the meaning of the parameter. Used for writers for this data point type. See
Writer
.
-
data_point_type_options
= {}¶ Options specified in the same way as module options that control the nature of the document type. These are not things to do with reading of specific datasets, for which the dataset’s metadata should be used. These are things that have an impact on typechecking, such that options on the two checked datatypes are required to match for the datatypes to be considered compatible.
This corresponds exactly to a PimlicoDatatype’s datatype_options and is processed in the same way.
They should always be an ordered dict, so that they can be specified using positional arguments as well as kwargs and config parameters.
-
data_point_type_supports_python2
= True¶ Most core Pimlico datatypes support use in Python 2 and 3. Datatypes that do should set this to True. If it is False, the datatype is assumed to work only in Python 3.
Python 2 compatibility requires extra work from the programmer. Datatypes should generally declare whether or not they provide this support by overriding this explicitly.
Use
supports_python2()
to check whether a data-point type instance supports Python 2. (There may be reasons for a datatype’s instance to override this class-level setting.)
-
name
¶
-
check_type
(supplied_type)[source]¶ Type checking for an iterable corpus calls this to check that the supplied data point type matches the required one (i.e. this instance). By default, the supplied type is simply required to be an instance of the required type (or one of its subclasses).
This may be overridden to introduce other type checks.
-
is_type_for_doc
(doc)[source]¶ Check whether the given document is of this type, or a subclass of this one.
If the object is not a document instance (or, more precisely, doesn’t have a data_point_type attr), this will always return False.
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
classmethod
full_class_name
()[source]¶ The fully qualified name of the class for this data point type, by which it is referenced in config files. Used in docs
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
object
The abstract superclass of all documents.
You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each document type. You can add functionality to a datapoint type’s document by creating a nested Document class. This will inherit from the parent datapoint type’s document. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:
class MyDataPointType(DataPointType): class Document: # Overide document things here # Add your own methods, properties, etc for getting data from the document
A data point type’s constructed document class is available as MyDataPointType.Document.
Each document type should provide a method to convert from raw data (a bytes object in Py3, or
future
’s backport ofbytes
in Py2) to the internal representation (an arbitrary dictionary) called raw_to_internal(), and another to convert the other way called internal_to_raw(). Both forms of the data are available using the properties raw_data and internal_data, and these methods are called as necessary to convert back and forth.This is to avoid unnecessary conversions. For example, if the raw data is supplied and then only the raw data is ever used (e.g. passing the document straight through and writing out to disk), we want to avoid converting back and forth.
A subtype should then supply methods or properties (typically using the cached_property decorator) to provide access to different parts of the data. See the many built-in document types for examples of doing this.
You should not generally need to override the __init__ method. You may, however, wish to override internal_available() or raw_available(). These are called as soon as the internal data or raw data, respectively, become available, which may be at instantiation or after conversion. This can be useful if there are bits of computation that you want to do on the basis of one of these and then store to avoid repeated computation.
-
keys
= []¶ Specifies the keys that a document has in its internal data Subclasses should specify their keys The internal data fields corresponding to these can be accessed as attributes of the document
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
internal_to_raw
(internal_data)[source]¶ Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.
-
raw_available
()[source]¶ Called as soon as the raw data becomes available, either at instantiation or conversion.
-
internal_available
()[source]¶ Called as soon as the internal data becomes available, either at instantiation or conversion.
-
raw_data
¶
-
internal_data
¶
-
-
-
class
InvalidDocument
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.DataPointType
Widely used in Pimlico to represent an empty document that is empty not because the original input document was empty, but because a module along the way had an error processing it. Document readers/writers should generally be robust to this and simply pass through the whole thing where possible, so that it’s always possible to work out, where one of these pops up, where the error occurred.
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for InvalidDocument
-
keys
= ['module_name', 'error_info']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
internal_to_raw
(internal_data)[source]¶ Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.
-
module_name
¶
-
error_info
¶
-
-
-
class
RawDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.DataPointType
Base document type. All document types for grouped corpora should be subclasses of this.
It may be used itself as well, where documents are just treated as raw data, though most of the time it will be appropriate to use subclasses to provide more information and processing operations specific to the datatype.
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for RawDocumentType
-
keys
= ['raw_data']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
TextDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Documents that contain text, most often human-readable documents from a textual corpus. Most often used as a superclass for other, more specific, document types.
This type does not special processing, since the storage format is already a unicode string, which is fine for raw text. However, it serves to indicate that the document represents text (not just any old raw data).
The property text provides the text, which is, for this base type, just the raw data. However, subclasses will override this, since their raw data will contain information other than the raw text.
-
data_point_type_supports_python2
= True¶
-
formatters
= [('text', 'pimlico.datatypes.corpora.formatters.text.TextDocumentFormatter')]¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for TextDocumentType
-
keys
= ['text']¶
-
internal_to_raw
(internal_data)[source]¶ Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
RawTextDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.TextDocumentType
Subclass of TextDocumentType used to indicate that the text hasn’t been processed (tokenized, etc). Note that text that has been tokenized, parsed, etc does not use subclasses of this type, so they will not be considered compatible if this type is used as a requirement.
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for RawTextDocumentType
-