OpenNLP NIST tokenizer¶
Path | pimlico.modules.nltk.nist_tokenize |
Executable | yes |
Sentence splitting and tokenization using the NLTK NIST tokenizer.
Inputs¶
Name | Type(s) |
---|---|
text | TarredCorpus<RawTextDocumentType> |
Outputs¶
Name | Type(s) |
---|---|
documents | TokenizedCorpus |
Options¶
Name | Description | Type |
---|---|---|
lowercase | Lowercase all output. Default: False | bool |
non_european | Use the tokenizer’s international_tokenize() method instead of tokenize(). Default: False | bool |