Normalize raw text

Path pimlico.modules.text.text_normalize
Executable yes

Text normalization for raw text documents.

Similar to normalize module, but operates on raw text, not pre-tokenized text, so provides a slightly different set of tools.

Inputs

Name Type(s)
corpus grouped_corpus <TextDocumentType>

Outputs

Name Type(s)
corpus grouped_corpus <RawTextDocumentType>

Options

Name Description Type
blank_lines Remove all blank lines (after whitespace stripping, if requested) bool
case Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation ‘upper’, ‘lower’ or ‘’
strip Strip whitespace from the start and end of lines bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output
blank_lines=T
case=
strip=T

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.