Normalize tokenized text

Path pimlico.modules.text.normalize
Executable yes

Perform text normalization on tokenized documents.

Currently, this includes the following:

  • case normalization (to upper or lower case)
  • blank line removal
  • empty sentence removal
  • punctuation removal
  • removal of words that contain only punctuation
  • numerical character removal
  • minimum word length filter

In the future, more normalization operations may be added.

Inputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Options

Name Description Type
case Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation ‘upper’, ‘lower’ or ‘’
min_word_length Remove any words shorter than this. Default: 0 (don’t do anything) int
remove_empty Skip over any empty sentences (i.e. blank lines). Applied after other processing, so this will remove sentences that are left empty by other filters bool
remove_nums Remove numeric characters bool
remove_only_punct Skip over any sentences that are empty if punctuation is ignored bool
remove_punct Remove punctuation from all tokens and then remove the whole token if nothing’s left bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output
case=
min_word_length=0
remove_empty=F
remove_nums=F
remove_only_punct=F
remove_punct=F

Example pipelines

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.