Normalize tokenized text

Path pimlico.modules.text.normalize
Executable yes

Perform text normalization on tokenized documents.

Currently, this includes only the following:

  • case normalization (to upper or lower case)
  • blank line removal
  • empty sentence removal

In the future, more normalization operations may be added.

Inputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Options

Name Description Type
case Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation ‘upper’, ‘lower’ or ‘’
remove_empty Skip over any empty sentences (i.e. blank lines) bool
remove_only_punct Skip over any sentences that are empty if punctuation is ignored bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output
case=
remove_empty=F
remove_only_punct=F

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.