strings

Documents consisting of strings.

See also

TextDocumentType and RawTextDocumentType: basic text (i.e. unicode string) document types for normal textual documents.

class LabelDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Simple document type for storing a short label associated with a document.

Identical to TextDocumentType, but distinguished for typechecking, so that only corpora designed to be used as short labels can be used as input where a label corpus is required.

The string label is stored in the label attribute.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for LabelDocumentType

keys = ['label']
internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.