Normalize raw text

Path pimlico.modules.text.text_normalize
Executable yes

Text normalization for raw text documents.

Inputs

Name Type(s)
corpus TarredCorpus<TextDocumentType>

Outputs

Name Type(s)
corpus RawTextTarredCorpus

Options

Name Description Type
case Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation ‘upper’, ‘lower’ or ‘’
blank_lines Remove all blank lines (after whitespace stripping, if requested) bool
strip Strip whitespace from the start and end of lines bool