pimlico.datatypes.tokenized module

class pimlico.datatypes.tokenized.TokenizedCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.

Each document is a list of sentences. Each sentence is a list of words.

data_point_type

alias of TokenizedDocumentType

datatype_name = 'tokenized'
class pimlico.datatypes.tokenized.TokenizedCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.

document_to_raw_data(doc)[source]