Simple tokenization¶

Path	pimlico.modules.text.simple_tokenize
Executable	yes

Tokenize raw text using simple splitting.

This is useful where either you don’t mind about the quality of the tokenization and just want to test something quickly, or text is actually already tokenized, but stored as a raw text datatype.

If you want to do proper tokenization, consider either the CoreNLP or OpenNLP core modules.

Inputs¶

Name	Type(s)
corpus	TarredCorpus<TextDocumentType>

Outputs¶

Name	Type(s)
corpus	`TokenizedDocumentTypeTarredCorpus`

Options¶

Name	Description	Type
splitter	Character or string to split on. Default: space	<type ‘unicode’>