Constituency parser

Path pimlico.modules.opennlp.parse
Executable yes

Constituency parsing using OpenNLP’s tools.

We run OpenNLP in the background using a Py4J wrapper, just as with the other OpenNLP wrappers.

The output format is not yet ideal: currently we produce documents consisting of a list of strings, each giving the OpenNLP tree output for a sentence. It would be better to use a standard constituency tree datatype that can be used generically as input to any modules required tree input. For now, if you write a module taking input from the parser, it will itself need to process the strings from the OpenNLP parser output.

Inputs

Name Type(s)
documents grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
trees grouped_corpus <OpenNLPTreeStringsDocumentType>

Options

Name Description Type
model Parser model, full path or directory name. If a filename is given, it is expected to be in the OpenNLP model directory (models/opennlp/) string

Example config

This is an example of how this module can be used in a pipeline config file.

[my_opennlp_parser_module]
type=pimlico.modules.opennlp.parse
input_documents=module_a.some_output

This example usage includes more options.

[my_opennlp_parser_module]
type=pimlico.modules.opennlp.parse
input_documents=module_a.some_output
model=en-parser-chunking.bin

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.