tokenized¶

class TokenizedDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.TextDocumentType

Specialized data point type for documents that have had tokenization applied. It does very little processing - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.

Each document is a list of sentences. Each sentence is a list of words.

formatters = [('tokenized_doc', 'pimlico.datatypes.corpora.tokenized.TokenizedDocumentFormatter')]¶

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for TokenizedDocumentType

keys = ['sentences']¶

text¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class CharacterTokenizedDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.tokenized.TokenizedDocumentType

Simple character-level tokenized corpus. The text isn’t stored in any special way, but is represented when read internally just as a sequence of characters in each sentence.

If you need a more sophisticated way to handle character-type (or any non-word) units within each sequence, see SegmentedLinesDocumentType.

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.tokenized.Document

Document class for CharacterTokenizedDocumentType

sentences¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class SegmentedLinesDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.tokenized.TokenizedDocumentType

Document consisting of lines, each split into elements, which may be characters, words, or whatever. Rather like a tokenized corpus, but doesn’t make the assumption that the elements (words in the case of a tokenized corpus) don’t include spaces.

You might use this, for example, if you want to train character-level models on a text corpus, but don’t use strictly single-character units, perhaps grouping together certain short character sequences.

Uses the character / to separate elements in the raw data. If a / is found in an element, it is stored as @slash@, so this string is assumed not to be used in any element (which seems reasonable enough, generally).

data_point_type_supports_python2 = True¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.tokenized.Document

Document class for SegmentedLinesDocumentType

text¶

sentences¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.