Tokenizer¶
Path | pimlico.modules.opennlp.tokenize |
Executable | yes |
Sentence splitting and tokenization using OpenNLP’s tools.
Sentence splitting may be skipped by setting the option tokenize_only=T. The tokenizer will then assume that each line in the input file represents a sentence and tokenize within the lines.
Inputs¶
Name | Type(s) |
---|---|
text | grouped_corpus <TextDocumentType > |
Outputs¶
Name | Type(s) |
---|---|
documents | grouped_corpus <TokenizedDocumentType > |
Options¶
Name | Description | Type |
---|---|---|
sentence_model | Sentence segmentation model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) | string |
token_model | Tokenization model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) | string |
tokenize_only | By default, sentence splitting is performed prior to tokenization. If tokenize_only is set, only the tokenization step is executed | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
This example usage includes more options.
[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
sentence_model=en-sent.bin
token_model=en-token.bin
tokenize_only=F
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.