Normalize raw text¶
Path | pimlico.modules.text.text_normalize |
Executable | yes |
Text normalization for raw text documents.
Inputs¶
Name | Type(s) |
---|---|
corpus | TarredCorpus<TextDocumentType> |
Outputs¶
Name | Type(s) |
---|---|
corpus | RawTextTarredCorpus |
Options¶
Name | Description | Type |
---|---|---|
case | Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation | ‘upper’, ‘lower’ or ‘’ |
blank_lines | Remove all blank lines (after whitespace stripping, if requested) | bool |
strip | Strip whitespace from the start and end of lines | bool |