Normalize raw text¶
Path | pimlico.modules.text.text_normalize |
Executable | yes |
Text normalization for raw text documents.
Similar to normalize
module, but operates on raw text,
not pre-tokenized text, so provides a slightly different set of tools.
Inputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <TextDocumentType > |
Outputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <RawTextDocumentType > |
Options¶
Name | Description | Type |
---|---|---|
blank_lines | Remove all blank lines (after whitespace stripping, if requested) | bool |
case | Transform all text to upper or lower case. Choose from ‘upper’ or ‘lower’, or leave blank to not perform transformation | ‘upper’, ‘lower’ or ‘’ |
strip | Strip whitespace from the start and end of lines | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output
This example usage includes more options.
[my_text_normalize_module]
type=pimlico.modules.text.text_normalize
input_corpus=module_a.some_output
blank_lines=T
case=
strip=T
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.