Tokenizer¶

Path	pimlico.modules.opennlp.tokenize
Executable	yes

Sentence splitting and tokenization using OpenNLP’s tools.

Sentence splitting may be skipped by setting the option tokenize_only=T. The tokenizer will then assume that each line in the input file represents a sentence and tokenize within the lines.

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`TextDocumentType`>

Outputs¶

Name	Type(s)
documents	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
sentence_model	Sentence segmentation model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/)	string
token_model	Tokenization model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/)	string
tokenize_only	By default, sentence splitting is performed prior to tokenization. If tokenize_only is set, only the tokenization step is executed	bool

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output

This example usage includes more options.

[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
sentence_model=en-sent.bin
token_model=en-token.bin
tokenize_only=F

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

opennlp_tokenize