Normalize tokenized text

Path pimlico.modules.text.normalize
Executable yes

Perform text normalization on tokenized documents.

Currently, this includes only case normalization (to upper or lower case). In the future, more normalization operations may be added.

Inputs

Name Type(s)
corpus TarredCorpus<TokenizedDocumentType>

Outputs

Name Type(s)
corpus TokenizedCorpus

Options

Name Description Type
case Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation ‘upper’, ‘lower’ or ‘’