NIST tokenizer

Path pimlico.modules.nltk.nist_tokenize
Executable yes

Sentence splitting and tokenization using the NLTK NIST tokenizer.

Very simple tokenizer that’s fairly language-independent and doesn’t need a trained model. Use this if you just need a rudimentary tokenization (though more sophisticated than simple_tokenize).

Inputs

Name Type(s)
text grouped_corpus <RawTextDocumentType>

Outputs

Name Type(s)
documents grouped_corpus <TokenizedDocumentType>

Options

Name Description Type
lowercase Lowercase all output. Default: False bool
non_european Use the tokenizer’s international_tokenize() method instead of tokenize(). Default: False bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output

This example usage includes more options.

[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output
lowercase=F
non_european=F

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.