CAEVO output convertor

Path pimlico.modules.caevo.output
Executable yes

Tool to split up the output from Caevo and convert it to other datatypes.

Caevo output includes the output from a load of NLP tools that it runs as prerequisites to event extraction, etc. The individual parts of the output can easily be retrieved from the output corpus via the output datatype. In order to be able to use them as input to other modules, they need to be converted to compatible standard datatypes.

For example, tokenization output is stored in Caevo’s XML output using a special format. Instead of writing other modules in such a way as to be able to pull this information out of the :class:~pimlico.datatypes.CaevoCorpus, you can filter the output using this module to provide a :class:~pimlico.datatypes.TokenizedCorpus, which is a standard format for input to other module types.

As with other document map modules, you can use this as a filter (filter=T), so you can actually need to commit the converted data to disk.

Todo

Add more output convertors: currently only provides tokenization

Inputs

Name Type(s)
documents CaevoCorpus

Outputs

No non-optional outputs

Optional

Name Type(s)
tokenized TokenizedCorpus
parse ConstituencyParseTreeCorpus
pos WordAnnotationCorpusWithWordAndPos

Options

Name Description Type
gzip If True, each output, except annotations, for each document is gzipped. This can help reduce the storage occupied by e.g. parser or coref output. Default: False bool