pimlico.datatypes.tokenized module¶
-
class
TokenizedDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.documents.TextDocumentType
-
formatters
= [('tokenized_doc', 'pimlico.datatypes.formatters.tokenized.TokenizedDocumentFormatter')]¶
-
-
class
TokenizedCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.
Each document is a list of sentences. Each sentence is a list of words.
-
datatype_name
= 'tokenized'¶
-
data_point_type
¶ alias of
TokenizedDocumentType
-
-
class
TokenizedCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.
-
document_to_raw_data
(doc)[source]¶ Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.
-
-
class
CharacterTokenizedDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.tokenized.TokenizedDocumentType
Simple character-level tokenized corpus. The text isn’t stored in any special way, but is represented when read internally just as a sequence of characters in each sentence.
If you need a more sophisticated way to handle character-type (or any non-word) units within each sequence, see SegmentedLinesDocumentType.
-
formatters
= [('char_tokenized_doc', 'pimlico.datatypes.formatters.tokenized.CharacterTokenizedDocumentFormatter')]¶
-
-
class
CharacterTokenizedCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Simple writer that takes lists of char-tokens and outputs them with a sentence per line. Just joins together all the characters to store the sentence, since they can be divided up again when read.
-
document_to_raw_data
(doc)[source]¶ Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.
-
-
class
SegmentedLinesDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.tokenized.TokenizedDocumentType
Document consisting of lines, each split into elements, which may be characters, words, or whatever. Rather like a tokenized corpus, but doesn’t make the assumption that the elements (words in the case of a tokenized corpus) don’t include spaces.
You might use this, for example, if you want to train character-level models on a text corpus, but don’t use strictly single-character units, perhaps grouping together certain short character sequences.
Uses the character / to separate elements. If a / is found in an element, it is stored as @slash@, so this string is assumed not to be used in any element (which seems reasonable enough, generally).
-
formatters
= [('segmented_lines', 'pimlico.datatypes.formatters.tokenized.SegmentedLinesFormatter')]¶
-
-
class
SegmentedLinesCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(doc)[source]¶ Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.
-