base

class CountInvalidCmd[source]

Bases: pimlico.cli.shell.base.ShellCommand

Data shell command to count up the number of invalid docs in a tarred corpus. Applies to any iterable corpus.

commands = ['invalid']
help_text = 'Count the number of invalid documents in this dataset'
execute(shell, *args, **kwargs)[source]

Execute the command. Get the dataset reader as shell.data.

Parameters:
  • shell – DataShell instance. Reader available as shell.data
  • args – Args given by the user
  • kwargs – Named args given by the user as key=val
data_point_type_opt(text)[source]
class IterableCorpus(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Superclass of all datatypes which represent a dataset that can be iterated over document by document (or datapoint by datapoint - what exactly we’re iterating over may vary, though documents are most common).

This is an abstract base class and doesn’t provide any mechanisms for storing documents or organising them on disk in any way. Many input modules will override this to provide a reader that iterates over the documents directly, according to IterableCorpus’ interface. The main subclass of this used within pipelines is GroupedCorpus, which provides an interface for iterating over groups of documents and a storage mechanism for grouping together documents in archives on disk.

May be used as a type requirement, but remember that it is not possible to create a reader from this type directly: use a subtype, like GroupedCorpus, instead.

The actual type of the data depends on the type given as the first argument, which should be an instance of DataPointType or a subclass: it could be, e.g. coref output, etc. Information about the type of individual documents is provided by data_point_type and this is used in type checking.

Note that the data point type is the first datatype option, so can be given as the first positional arg when instantiating an iterable corpus subtype:

corpus_type = GroupedCorpus(RawTextDocumentType())
corpus_reader = corpus_type("... base dir path ...")

At creation time, length should be provided in the metadata, denoting how many documents are in the dataset.

datatype_name = 'iterable_corpus'
shell_commands = [<pimlico.datatypes.corpora.base.CountInvalidCmd object>]
datatype_options = {'data_point_type': {'default': DataPointType(), 'help': 'Data point type for the iterable corpus. This is used to process each document in the corpus in an appropriate way. Should be a subclass of DataPointType. This should almost always be given, typically as the first positional arg when instantiating the datatype. Defaults to the generic data point type at the top of the hierarchy. When specifying as a string (e.g. loading from a config file), you can specify data-point type options in brackets after the class name, separated by semicolons (;). These are processed in the same way as other options. E.g. WordAnnotationsDocumentType(fields=xyz,abc; some_key=52)', 'type': <function data_point_type_opt>}}
datatype_supports_python2 = True
supports_python2()[source]

Whether a corpus type supports Python 2, depends on its document type. The corpus datatype introduces no reason not to, but specific document types might.

run_browser(reader, opts)[source]

Launches a browser interface for reading this datatype, browsing the data provided by the given reader.

Not all datatypes provide a browser. For those that don’t, this method should raise a NotImplementedError.

opts provides the argparser options from the command line.

This tool used to be only available for iterable corpora, but now it’s possible for any datatype to provide a browser. IterableCorpus provides its own browser, as before, which uses one of the data point type’s formatters to format documents.

class Reader(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for IterableCorpus

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the data.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

list_iter()[source]

Iterate over the list of document names, without yielding the doc contents.

Whilst this could be considerably faster than iterating over all the docs, the default implementation, if not overridden by subclasses of IterableCorpus, simply calls the doc iter and throws away the docs.

data_to_document(data, metadata=None)[source]

Applies the corpus’ datatype’s processing to the raw data, given as a bytes object, and produces a document instance.

Parameters:
  • metadata – dict containing doc metadata (optional)
  • data – bytes raw data
Returns:

document instance

class Setup(datatype, data_paths)

Bases: pimlico.datatypes.base.Setup

Setup class for IterableCorpus.Reader

data_ready(path)

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

get_base_dir()
Returns:the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
get_data_dir()
Returns:the path to the data dir within the base dir (typically a dir called “data”)
get_reader(pipeline, module=None)

Instantiate a reader using this setup.

Parameters:
  • pipeline – currently loaded pipeline
  • module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

read_metadata(base_dir)

Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.

reader_type

alias of IterableCorpus.Reader

ready_to_read()

Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.

Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.

Returns:True if the reader’s ready to be instantiated, False otherwise
class Writer(datatype, *args, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Stores the length of the corpus.

NB: IterableCorpus itself has no particular way of storing files, so this is only here to ensure that all subclasses (e.g. GroupedCorpus) store a length in the same way.

metadata_defaults = {'length': (None, 'Number of documents in the corpus. Must be set by the writer, otherwise an exception will be raised at the end of writing')}
writer_param_defaults = {}
check_type(supplied_type)[source]

Override type checking to require that the supplied type have a document type that is compatible with (i.e. a subclass of) the document type of this class.

The data point types can also introduce their own checks, other than simple isinstance checks.

type_checking_name()[source]

Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Classes that override check_supplied_type() may want to override this too.

full_datatype_name()[source]

Returns a string/unicode name for the datatype that includes relevant sub-type information. The default implementation just uses the attribute datatype_name, but subclasses may have more detailed information to add. For example, iterable corpus types also supply information about the data-point type.