tokenized¶
-
class
TokenizedDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.TextDocumentType
Specialized data point type for documents that have had tokenization applied. It does very little processing - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.
Each document is a list of sentences. Each sentence is a list of words.
-
formatters
= [('tokenized_doc', 'pimlico.datatypes.corpora.tokenized.TokenizedDocumentFormatter')]¶
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for TokenizedDocumentType
-
keys
= ['sentences']¶
-
text
¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
CharacterTokenizedDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.TokenizedDocumentType
Simple character-level tokenized corpus. The text isn’t stored in any special way, but is represented when read internally just as a sequence of characters in each sentence.
If you need a more sophisticated way to handle character-type (or any non-word) units within each sequence, see SegmentedLinesDocumentType.
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.Document
Document class for CharacterTokenizedDocumentType
-
sentences
¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
SegmentedLinesDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.TokenizedDocumentType
Document consisting of lines, each split into elements, which may be characters, words, or whatever. Rather like a tokenized corpus, but doesn’t make the assumption that the elements (words in the case of a tokenized corpus) don’t include spaces.
You might use this, for example, if you want to train character-level models on a text corpus, but don’t use strictly single-character units, perhaps grouping together certain short character sequences.
Uses the character / to separate elements in the raw data. If a / is found in an element, it is stored as @slash@, so this string is assumed not to be used in any element (which seems reasonable enough, generally).
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.Document
Document class for SegmentedLinesDocumentType
-
text
¶
-
sentences
¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-