pimlico.datatypes.word_annotations module

class WordAnnotationsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

sentence_boundary_re
word_boundary
word_re
process_document(raw_data)[source]
class WordAnnotationCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

datatype_name = 'word_annotations'
data_point_type

alias of WordAnnotationsDocumentType

annotation_fields = None
read_annotation_fields()[source]

Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.

data_ready()[source]
class WordAnnotationCorpusWriter(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.

class SimpleWordAnnotationCorpusWriter(base_dir, field_names, field_sep=u'|', **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.

document_to_raw_data(data)
class AddAnnotationField(input_name, add_fields)[source]

Bases: pimlico.datatypes.base.DynamicOutputDatatype

get_datatype(module_info)[source]
classmethod get_base_datatype_class()[source]
class WordAnnotationCorpusWithRequiredFields(required_fields)[source]

Bases: pimlico.datatypes.base.DynamicInputDatatypeRequirement

Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.

check_type(supplied_type)[source]
exception AnnotationParseError[source]

Bases: exceptions.Exception