formatter

The command-line iterable corpus browser displays one document at a time. It can display the raw data from the corpus files, which sometimes is sufficiently human-readable to not need any special formatting. It can also parse the data using its datatype and output text either from the datatype’s standard unicode representation or, if the document datatype provides it, a special browser formatting of the data.

When viewing output data, particularly during debugging of modules, it can be useful to provide special formatting routines to the browser, rather than using or overriding the datatype’s standard formatting methods. For example, you might want to pull out specific attributes for each document to get an overview of what’s coming out.

The browser command accepts a command-line option that specifies a Python class to format the data. This class should be a subclass of :class:~pimlico.cli.browser.formatter.DocumentBrowserFormatter that accepts a datatype compatible with the datatype being browsed and provides a method to format each document. You can write these in your custom code and refer to them by their fully qualified class name.

class DocumentBrowserFormatter(corpus_datatype)[source]

Bases: object

Base class for formatters used to post-process documents for display in the iterable corpus browser.

DATATYPE = DataPointType()
format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.

filter_document(doc)[source]

Each doc is passed through this function directly after being read from the corpus. If None is returned, the doc is skipped. Otherwise, the result is used instead of the doc data. The default implementation does nothing.

class DefaultFormatter(corpus_datatype)[source]

Bases: pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter

Generic implementation of a browser formatter that’s used if no other formatter is given.

DATATYPE = DataPointType()
format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.

class InvalidDocumentFormatter(corpus_datatype)[source]

Bases: pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter

Formatter that skips over all docs other than invalid results. Uses standard formatting for InvalidDocument information.

format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.

filter_document(doc)[source]

Each doc is passed through this function directly after being read from the corpus. If None is returned, the doc is skipped. Otherwise, the result is used instead of the doc data. The default implementation does nothing.

typecheck_formatter(formatted_doc_type, formatter_cls)[source]

Check that a document type is compatible with a particular formatter.

load_formatter(datatype, formatter_name=None)[source]

Load a formatter specified by its fully qualified Python class name. If None, loads the default formatter. You may also specify a formatter by name, choosing from one of the standard ones that the formatted datatype gives.

Parameters:
  • datatype – datatype instance representing the datatype that will be formatted
  • formatter_name – class name, or class
Returns:

instantiated formatter