Simple tokenization

Path pimlico.modules.text.simple_tokenize
Executable yes

Tokenize raw text using simple splitting.

This is useful where either you don’t mind about the quality of the tokenization and just want to test something quickly, or text is actually already tokenized, but stored as a raw text datatype.

If you want to do proper tokenization, consider either the CoreNLP or OpenNLP core modules.


Name Type(s)
corpus TarredCorpus<TextDocumentType>


Name Type(s)
corpus TokenizedDocumentTypeTarredCorpus


Name Description Type
splitter Character or string to split on. Default: space <type ‘unicode’>