pimlico.datatypes.tokenized module¶

class TokenizedDocumentType(options, metadata)[source]¶

Bases: pimlico.datatypes.documents.TextDocumentType

formatters = [('tokenized_doc', 'pimlico.datatypes.formatters.tokenized.TokenizedDocumentFormatter')]¶

process_document(doc, as_type=None)[source]¶

class TokenizedCorpus(base_dir, pipeline, **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpus

Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.

Each document is a list of sentences. Each sentence is a list of words.

datatype_name = 'tokenized'¶

data_point_type¶: alias of TokenizedDocumentType

class TokenizedCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.

document_to_raw_data(doc)[source]¶: Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.

class CharacterTokenizedDocumentType(options, metadata)[source]¶

Bases: pimlico.datatypes.tokenized.TokenizedDocumentType

Simple character-level tokenized corpus. The text isn’t stored in any special way, but is represented when read internally just as a sequence of characters in each sentence.

If you need a more sophisticated way to handle character-type (or any non-word) units within each sequence, see SegmentedLinesDocumentType.

formatters = [('char_tokenized_doc', 'pimlico.datatypes.formatters.tokenized.CharacterTokenizedDocumentFormatter')]¶

process_document(doc, as_type=None)[source]¶

class CharacterTokenizedCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Simple writer that takes lists of char-tokens and outputs them with a sentence per line. Just joins together all the characters to store the sentence, since they can be divided up again when read.

document_to_raw_data(doc)[source]¶: Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.

class SegmentedLinesDocumentType(options, metadata)[source]¶

Bases: pimlico.datatypes.tokenized.TokenizedDocumentType

Document consisting of lines, each split into elements, which may be characters, words, or whatever. Rather like a tokenized corpus, but doesn’t make the assumption that the elements (words in the case of a tokenized corpus) don’t include spaces.

You might use this, for example, if you want to train character-level models on a text corpus, but don’t use strictly single-character units, perhaps grouping together certain short character sequences.

Uses the character / to separate elements. If a / is found in an element, it is stored as @slash@, so this string is assumed not to be used in any element (which seems reasonable enough, generally).

formatters = [('segmented_lines', 'pimlico.datatypes.formatters.tokenized.SegmentedLinesFormatter')]¶

process_document(doc, as_type=None)[source]¶

class SegmentedLinesCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(doc)[source]¶: Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.