word_annotations

Textual corpus type where each word is accompanied by some annotations.

class WordAnnotationsDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.tokenized.TokenizedDocumentType

List of sentences, each consisting of a list of word, each consisting of a tuple of the token and its annotations.

The document type needs to know what fields will be provided, so that it’s possible for a module to require a particular set of fields. The field list also tells the reader in which position to find each field.

E.g. the field list “word,lemma,pos” will store values like “walks|walk|VB” for each token. You could also provide “word,pos,lemma” with “walks|VB|walk” and the reader would know where to find the fields it needs.

When a WordAnnotationsDocumentType is used as an input type requirement, it will accept any input corpus that also has a WordAnnotationsDocumentType as its data-point type and includes at least all of the fields specified for the requirement.

So, a requirement of GroupedCorpus(WordAnnotationsDocumentType(["word", "pos"])) will match a supplied type of GroupedCorpus(WordAnnotationsDocumentType(["word", "pos"])), or GroupedCorpus(WordAnnotationsDocumentType(["word", "pos", "lemma])), but not GroupedCorpus(WordAnnotationsDocumentType(["word", "lemma"])).

Annotations are given as strings, not other types (like ints). If you want to store e.g. int or float annotations, you need to do the conversion separately, as the encoding and decoding assumes only strings are used.

Annotations may, however, be None. This, as well as any linebreaks and tabs in the strings, will be encoded/decoded by the writer/reader.

data_point_type_options = {'fields': {'help': "Names of the annotation fields. These include the word itself. Typically the first field is therefore called 'word', but this is not required. However, there must be a field called 'word', since this datatype overrides tokenized documents, so need to be able to provide the original text. When used as a module type requirement, the field list gives all the fields that must (at least) be provided by the supplied type. Specified as a comma-separated list. Required", 'required': True, 'type': <function comma_separated_list.<locals>._fn>}}
data_point_type_supports_python2 = True
check_type(supplied_type)[source]

Type checking for an iterable corpus calls this to check that the supplied data point type matches the required one (i.e. this instance). By default, the supplied type is simply required to be an instance of the required type (or one of its subclasses).

This may be overridden to introduce other type checks.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.tokenized.Document

Document class for WordAnnotationsDocumentType

keys = ['word_annotations']
text
sentences
get_field(field)[source]

Get the given field for every word in every sentence.

Must be one of the fields available in this datatype.

raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class AddAnnotationField(input_name, add_fields)[source]

Bases: pimlico.datatypes.base.DynamicOutputDatatype

Dynamic type constructor that can be used in place of a module’s output type. When called (when the output type is needed), dynamically creates a new type that is a corpus with WordAnnotationsDocumentType with the same fields as the named input to the module, with the addition of one or more new ones.

Parameters:
  • input_name – input to the module whose fields we extend
  • add_fields – field or fields to add, string names
get_datatype(module_info)[source]
get_base_datatype()[source]

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return a datatype instance. By default, it returns None.

If this information is available, it will be used in documentation.

AddAnnotationFields

alias of pimlico.datatypes.corpora.word_annotations.AddAnnotationField

class DependencyParsedDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.word_annotations.WordAnnotationsDocumentType

WordAnnotationsDocumentType with fields word, pos, head, deprel for each token.

Convenience wrapper for use as an input requirement where parsed text is needed.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)

Bases: pimlico.datatypes.corpora.word_annotations.Document

Document class for DependencyParsedDocumentType