pimlico.datatypes.tokenized module¶
-
class
pimlico.datatypes.tokenized.
TokenizedCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.
Each document is a list of sentences. Each sentence is a list of words.
-
datatype_name
= 'tokenized'¶
-
-
class
pimlico.datatypes.tokenized.
TokenizedCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8')[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.