pimlico.datatypes.word_annotations module

class WordAnnotationsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

sentence_boundary_re
word_boundary
word_re
process_document(raw_data)[source]
class WordAnnotationCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

datatype_name = 'word_annotations'
data_point_type

alias of WordAnnotationsDocumentType

annotation_fields = None
read_annotation_fields()[source]

Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.

data_ready()[source]

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

class WordAnnotationCorpusWriter(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.

class SimpleWordAnnotationCorpusWriter(base_dir, field_names, field_sep=u'|', **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.

document_to_raw_data(data)
class AddAnnotationField(input_name, add_fields)[source]

Bases: pimlico.datatypes.base.DynamicOutputDatatype

get_datatype(module_info)[source]
classmethod get_base_datatype_class()[source]

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return the class. By default, it returns None.

If this information is available, it will be used in documentation.

class WordAnnotationCorpusWithRequiredFields(required_fields)[source]

Bases: pimlico.datatypes.base.DynamicInputDatatypeRequirement

Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.

check_type(supplied_type)[source]
exception AnnotationParseError[source]

Bases: exceptions.Exception