pimlico.datatypes.word_annotations module

exception pimlico.datatypes.word_annotations.AnnotationParseError[source]

Bases: exceptions.Exception

class pimlico.datatypes.word_annotations.WordAnnotationCorpus(base_dir, pipeline)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

data_point_type

alias of WordAnnotationsDocumentType

data_ready()[source]
read_annotation_fields()[source]

Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.

annotation_fields = None
datatype_name = 'word_annotations'
class pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.

class pimlico.datatypes.word_annotations.SimpleWordAnnotationCorpusWriter(base_dir, field_names, field_sep=u'|', **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.

document_to_raw_data(data)
class pimlico.datatypes.word_annotations.AddAnnotationField(input_name, add_fields)[source]

Bases: pimlico.datatypes.base.DynamicOutputDatatype

classmethod get_base_datatype_class()[source]
get_datatype(module_info)[source]
class pimlico.datatypes.word_annotations.WordAnnotationCorpusWithRequiredFields(required_fields)[source]

Bases: pimlico.datatypes.base.DynamicInputDatatypeRequirement

Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.

check_type(supplied_type)[source]
class pimlico.datatypes.word_annotations.WordAnnotationsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(raw_data)[source]
sentence_boundary_re
word_boundary
word_re