strings¶
Documents consisting of strings.
See also
TextDocumentType
and
RawTextDocumentType
: basic text
(i.e. unicode string) document types for normal textual documents.
-
class
LabelDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Simple document type for storing a short label associated with a document.
Identical to
TextDocumentType
, but distinguished for typechecking, so that only corpora designed to be used as short labels can be used as input where a label corpus is required.The string label is stored in the
label
attribute.-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for LabelDocumentType
-
keys
= ['label']¶
-
internal_to_raw
(internal_data)[source]¶ Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
class