Normalize tokenized text¶

Path	pimlico.modules.text.normalize
Executable	yes

Perform text normalization on tokenized documents.

Currently, this includes only the following:

case normalization (to upper or lower case)

blank line removal

empty sentence removal

In the future, more normalization operations may be added.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TokenizedDocumentType`>

Name	Type(s)
corpus	`grouped_corpus` <`TokenizedDocumentType`>

Name	Description	Type
case	Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation	‘upper’, ‘lower’ or ‘’
remove_empty	Skip over any empty sentences (i.e. blank lines)	bool
remove_only_punct	Skip over any sentences that are empty if punctuation is ignored	bool

This is an example of how this module can be used in a pipeline config file.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output
case=
remove_empty=F
remove_only_punct=F

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

normalize