Human-readable formatting

Path pimlico.modules.corpora.format
Executable yes

Corpus formatter

Pimlico provides a data browser to make it easy to view documents in a tarred document corpus. Some datatypes provide a way to format the data for display in the browser, whilst others provide multiple formatters that display the data in different ways.

This module allows you to use this formatting functionality to output the formatted data as a corpus. Since the formatting operations are designed for display, this is generally only useful to output the data for human consumption.

Inputs

Name Type(s)
corpus grouped_corpus

Outputs

Name Type(s)
formatted grouped_corpus <RawTextDocumentType>

Options

Name Description Type
formatter Fully qualified class name of a formatter to use to format the data. If not specified, the default formatter is used, which uses the datatype’s browser_display attribute if available, or falls back to just converting documents to unicode string

Example config

This is an example of how this module can be used in a pipeline config file.

[my_format_module]
type=pimlico.modules.corpora.format
input_corpus=module_a.some_output

This example usage includes more options.

[my_format_module]
type=pimlico.modules.corpora.format
input_corpus=module_a.some_output
formatter=path.to.formatter.FormatterClass

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.