Simple tokenization¶

Path	pimlico.modules.text.simple_tokenize
Executable	yes

Tokenize raw text using simple splitting.

This is useful where either you don’t mind about the quality of the tokenization and just want to test something quickly, or text is actually already tokenized, but stored as a raw text datatype.

If you want to do proper tokenization, consider either the CoreNLP or OpenNLP core modules.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TextDocumentType`>

Outputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
splitter	Character or string to split on. Default: space	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_simple_tokenize_module]
type=pimlico.modules.text.simple_tokenize
input_corpus=module_a.some_output

This example usage includes more options.

[my_simple_tokenize_module]
type=pimlico.modules.text.simple_tokenize
input_corpus=module_a.some_output
splitter=

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

simple_tokenize