Simple tokenization

Path pimlico.modules.text.simple_tokenize
Executable yes

Tokenize raw text using simple splitting.

This is useful where either you don’t mind about the quality of the tokenization and just want to test something quickly, or text is actually already tokenized, but stored as a raw text datatype.

If you want to do proper tokenization, consider either the CoreNLP or OpenNLP core modules.

Inputs

Name Type(s)
corpus grouped_corpus <TextDocumentType>

Outputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Options

Name Description Type
splitter Character or string to split on. Default: space string

Example config

This is an example of how this module can be used in a pipeline config file.

[my_simple_tokenize_module]
type=pimlico.modules.text.simple_tokenize
input_corpus=module_a.some_output

This example usage includes more options.

[my_simple_tokenize_module]
type=pimlico.modules.text.simple_tokenize
input_corpus=module_a.some_output
splitter=

Example pipelines

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.