Normalize tokenized text¶

Path	pimlico.modules.text.normalize
Executable	yes

Perform text normalization on tokenized documents.

Currently, this includes the following:

case normalization (to upper or lower case)

blank line removal

empty sentence removal

punctuation removal

removal of words that contain only punctuation

numerical character removal

minimum word length filter

In the future, more normalization operations may be added.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TokenizedDocumentType`>

Outputs¶

Name	Type(s)
corpus	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
case	Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation	‘upper’, ‘lower’ or ‘’
min_word_length	Remove any words shorter than this. Default: 0 (don’t do anything)	int
remove_empty	Skip over any empty sentences (i.e. blank lines). Applied after other processing, so this will remove sentences that are left empty by other filters	bool
remove_nums	Remove numeric characters	bool
remove_only_punct	Skip over any sentences that are empty if punctuation is ignored	bool
remove_punct	Remove punctuation from all tokens and then remove the whole token if nothing’s left	bool

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output

This example usage includes more options.

[my_normalize_module]
type=pimlico.modules.text.normalize
input_corpus=module_a.some_output
case=
min_word_length=0
remove_empty=F
remove_nums=F
remove_only_punct=F
remove_punct=F

Example pipelines¶

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

train_tms_example

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

normalize