Normalize raw text¶

Path	pimlico.modules.text.text_normalize
Executable	yes

Text normalization for raw text documents.

Similar to normalize module, but operates on raw text, not pre-tokenized text, so provides a slightly different set of tools.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TextDocumentType`>

Name	Type(s)
corpus	`grouped_corpus` <`RawTextDocumentType`>

Name	Description	Type
blank_lines	Remove all blank lines (after whitespace stripping, if requested)	bool
case	Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation	‘upper’, ‘lower’ or ‘’
strip	Strip whitespace from the start and end of lines	bool

This is an example of how this module can be used in a pipeline config file.

[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output
blank_lines=T
case=
strip=T

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

normalize