pimlico.datatypes.vrt module

class VRTWord(word, *attributes)[source]

Bases: object

Word with all its annotations.

The Korp docs give the following example list of positional attributes (columns):

word form, the number of the token within the sentence, lemma, lemma with compound boundaries marked, part of speech, morphological analysis, dependency head number and dependency relation

However, they are not fixed and different files may have different numbers of attributes with different meanings. This information is not included in the data file.

class VRTText(words, paragraph_ranges=[], sentence_ranges=[], opening_tag=None)[source]

Bases: object

Contains a single VRT text (i.e. document).

Note that VRT’s structures are not hierarchical: they can be overlapping. See VRT docs.

We don’t currently process structural attributes. This can easily be added later if necessary.

static from_string(data)[source]
paragraphs
sentences
word_strings
class VRTDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.DataPointType

Document type for annotation text documents read in from VRT files (VeRticalized Text, as used by Korp:).

formatters = [('vrt', 'pimlico.datatypes.vrt.VRTFormatter')]
process_document(doc)[source]
class VRTFormatter(corpus)[source]

Bases: pimlico.cli.browser.formatter.DocumentBrowserFormatter

DATATYPE

alias of VRTDocumentType

format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.