NIST tokenizer¶

Path	pimlico.modules.nltk.nist_tokenize
Executable	yes

Sentence splitting and tokenization using the NLTK NIST tokenizer.

Very simple tokenizer that’s fairly language-independent and doesn’t need a trained model. Use this if you just need a rudimentary tokenization (though more sophisticated than simple_tokenize).

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`RawTextDocumentType`>

Outputs¶

Name	Type(s)
documents	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
lowercase	Lowercase all output. Default: False	bool
non_european	Use the tokenizer’s international_tokenize() method instead of tokenize(). Default: False	bool

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output

This example usage includes more options.

[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output
lowercase=F
non_european=F

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

nltk_nist_tokenize

nltk_nist_tokenize