OpenNLP tokenizer

Path pimlico.modules.opennlp.tokenize
Executable yes

Sentence splitting and tokenization using OpenNLP’s tools.

Inputs

Name Type(s)
text TarredCorpus<RawTextDocumentType>

Outputs

Name Type(s)
documents TokenizedCorpus

Options

Name Description Type
token_model Tokenization model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) string
tokenize_only By default, sentence splitting is performed prior to tokenization. If tokenize_only is set, only the tokenization step is executed bool
sentence_model Sentence segmentation model. Specify a full path, or just a filename. If a filename is given it is expected to be in the opennlp model directory (models/opennlp/) string