Tokenizer

Path pimlico.modules.opennlp.tokenize
Executable yes

Sentence splitting and tokenization using OpenNLP’s tools.

Sentence splitting may be skipped by setting the option tokenize_only=T. The tokenizer will then assume that each line in the input file represents a sentence and tokenize within the lines.

Inputs

Name Type(s)
text grouped_corpus <TextDocumentType>

Outputs

Name Type(s)
documents grouped_corpus <TokenizedDocumentType>

Options

Name Description Type
sentence_model Sentence segmentation model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) string
token_model Tokenization model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) string
tokenize_only By default, sentence splitting is performed prior to tokenization. If tokenize_only is set, only the tokenization step is executed bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output

This example usage includes more options.

[my_opennlp_tokenizer_module]
type=pimlico.modules.opennlp.tokenize
input_text=module_a.some_output
sentence_model=en-sent.bin
token_model=en-token.bin
tokenize_only=F

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.