tokenized_formatter¶
This is one of the test pipelines included in Pimlico’s repository. See Module test pipelines for more details.
Config file¶
The complete config file for this test pipeline:
# Test the tokenized text formatter
[pipeline]
name=tokenized_formatter
release=latest
# Take input from a prepared tokenized dataset
[europarl]
type=pimlico.datatypes.corpora.GroupedCorpus
data_point_type=TokenizedDocumentType
dir=%(test_data_dir)s/datasets/corpora/tokenized
# Format the tokenized data using the default formatter,
# which is declared for the tokenized datatype
[format]
type=pimlico.modules.corpora.format