pimlico.datatypes.word_annotations module¶
-
exception
pimlico.datatypes.word_annotations.
AnnotationParseError
[source]¶ Bases:
exceptions.Exception
-
class
pimlico.datatypes.word_annotations.
WordAnnotationCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
read_annotation_fields
()[source]¶ Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.
-
annotation_fields
= None¶
-
datatype_name
= 'word_annotations'¶
-
sentence_boundary_re
¶
-
word_boundary
¶
-
word_re
¶
-
-
class
pimlico.datatypes.word_annotations.
WordAnnotationCorpusWriter
(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.
-
class
pimlico.datatypes.word_annotations.
SimpleWordAnnotationCorpusWriter
(base_dir, field_names, field_sep=u'|', **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.word_annotations.
WordAnnotationCorpusWithRequiredFields
(required_fields)[source]¶ Bases:
pimlico.datatypes.base.DynamicInputDatatypeRequirement
Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.