OpenNLP tokenizer¶
| Path | pimlico.modules.opennlp.tokenize |
| Executable | yes |
Sentence splitting and tokenization using OpenNLP’s tools.
Sentence splitting may be skipped by setting the option tokenize_only=T. The tokenizer will then assume that each line in the input file represents a sentence and tokenize within the lines.
Inputs¶
| Name | Type(s) |
|---|---|
| text | grouped_corpus <TextDocumentType> |
Outputs¶
| Name | Type(s) |
|---|---|
| documents | grouped_corpus <TokenizedDocumentType> |
Options¶
| Name | Description | Type |
|---|---|---|
| sentence_model | Sentence segmentation model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) | string |
| token_model | Tokenization model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) | string |
| tokenize_only | By default, sentence splitting is performed prior to tokenization. If tokenize_only is set, only the tokenization step is executed | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
This example usage includes more options.
[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
sentence_model=en-sent.bin
token_model=en-token.bin
tokenize_only=F
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.