NIST tokenizer¶
| Path | pimlico.modules.nltk.nist_tokenize |
| Executable | yes |
Sentence splitting and tokenization using the NLTK NIST tokenizer.
Very simple tokenizer that’s fairly language-independent and doesn’t need
a trained model. Use this if you just need a rudimentary tokenization
(though more sophisticated than simple_tokenize).
Inputs¶
| Name | Type(s) |
|---|---|
| text | grouped_corpus <RawTextDocumentType> |
Outputs¶
| Name | Type(s) |
|---|---|
| documents | grouped_corpus <TokenizedDocumentType> |
Options¶
| Name | Description | Type |
|---|---|---|
| lowercase | Lowercase all output. Default: False | bool |
| non_european | Use the tokenizer’s international_tokenize() method instead of tokenize(). Default: False | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output
This example usage includes more options.
[my_nltk_nist_tokenizer_module]
type=pimlico.modules.nltk.nist_tokenize
input_text=module_a.some_output
lowercase=F
non_european=F
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.