Tokenized text to text

Path pimlico.modules.text.untokenize
Executable yes

Filter to take tokenized text and join it together to make raw text.

This module shouldn’t be necessary and will be removed later. For the time being, it’s here as a workaround for [this problem](, until it’s solved in the datatype redesign.

Tokenized text is a subtype of text, so theoretically it should be acceptable to modules that expect plain text (and is considered so by typechecking). But it provides an incompatible data structure, so things go bad if you use it like that.


Name Type(s)
corpus TarredCorpus<TokenizedDocumentType>


Name Type(s)
corpus TextDocumentTypeTarredCorpus


Name Description Type
sentence_joiner String to join lines/sentences on. (Default: linebreak) <type ‘unicode’>
joiner String to join words on. (Default: space) <type ‘unicode’>