word_annotations¶
Textual corpus type where each word is accompanied by some annotations.
-
class
WordAnnotationsDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.TokenizedDocumentType
List of sentences, each consisting of a list of word, each consisting of a tuple of the token and its annotations.
The document type needs to know what fields will be provided, so that it’s possible for a module to require a particular set of fields. The field list also tells the reader in which position to find each field.
E.g. the field list “word,lemma,pos” will store values like “walks|walk|VB” for each token. You could also provide “word,pos,lemma” with “walks|VB|walk” and the reader would know where to find the fields it needs.
When a WordAnnotationsDocumentType is used as an input type requirement, it will accept any input corpus that also has a WordAnnotationsDocumentType as its data-point type and includes at least all of the fields specified for the requirement.
So, a requirement of
GroupedCorpus(WordAnnotationsDocumentType(["word", "pos"]))
will match a supplied type ofGroupedCorpus(WordAnnotationsDocumentType(["word", "pos"]))
, orGroupedCorpus(WordAnnotationsDocumentType(["word", "pos", "lemma]))
, but notGroupedCorpus(WordAnnotationsDocumentType(["word", "lemma"]))
.Annotations are given as strings, not other types (like ints). If you want to store e.g. int or float annotations, you need to do the conversion separately, as the encoding and decoding assumes only strings are used.
Annotations may, however, be
None
. This, as well as any linebreaks and tabs in the strings, will be encoded/decoded by the writer/reader.-
data_point_type_options
= {'fields': {'help': "Names of the annotation fields. These include the word itself. Typically the first field is therefore called 'word', but this is not required. However, there must be a field called 'word', since this datatype overrides tokenized documents, so need to be able to provide the original text. When used as a module type requirement, the field list gives all the fields that must (at least) be provided by the supplied type. Specified as a comma-separated list. Required", 'required': True, 'type': <function comma_separated_list.<locals>._fn>}}¶
-
data_point_type_supports_python2
= True¶
-
check_type
(supplied_type)[source]¶ Type checking for an iterable corpus calls this to check that the supplied data point type matches the required one (i.e. this instance). By default, the supplied type is simply required to be an instance of the required type (or one of its subclasses).
This may be overridden to introduce other type checks.
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.tokenized.Document
Document class for WordAnnotationsDocumentType
-
keys
= ['word_annotations']¶
-
text
¶
-
sentences
¶
-
get_field
(field)[source]¶ Get the given field for every word in every sentence.
Must be one of the fields available in this datatype.
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
AddAnnotationField
(input_name, add_fields)[source]¶ Bases:
pimlico.datatypes.base.DynamicOutputDatatype
Dynamic type constructor that can be used in place of a module’s output type. When called (when the output type is needed), dynamically creates a new type that is a corpus with WordAnnotationsDocumentType with the same fields as the named input to the module, with the addition of one or more new ones.
Parameters: - input_name – input to the module whose fields we extend
- add_fields – field or fields to add, string names
-
AddAnnotationFields
¶ alias of
pimlico.datatypes.corpora.word_annotations.AddAnnotationField
-
class
DependencyParsedDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.word_annotations.WordAnnotationsDocumentType
WordAnnotationsDocumentType with fields word, pos, head, deprel for each token.
Convenience wrapper for use as an input requirement where parsed text is needed.
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)¶ Bases:
pimlico.datatypes.corpora.word_annotations.Document
Document class for DependencyParsedDocumentType
-
class