OpenNLP NIST tokenizer

Path pimlico.modules.nltk.nist_tokenize
Executable yes

Sentence splitting and tokenization using the NLTK NIST tokenizer.

Inputs

Name Type(s)
text TarredCorpus<RawTextDocumentType>

Outputs

Name Type(s)
documents TokenizedCorpus

Options

Name Description Type
lowercase Lowercase all output. Default: False bool
non_european Use the tokenizer’s international_tokenize() method instead of tokenize(). Default: False bool