pimlico.datatypes.base module¶

Datatypes provide interfaces for reading (and in some cases writing) datasets. At their most basic, they define a way to iterate over a dataset linearly. Some datatypes may also provide other functionality, such as random access or compression.

As much as possible, Pimlico pipelines should use standard datatypes to connect up the output of modules with the input of others. Most datatypes have a lot in common, which should be reflected in their sharing common base classes. Custom datatypes will be needed for most datasets when they’re used as inputs, but as far as possible, these should be converted into standard datatypes like TarredCorpus, early in the pipeline.

Note

The following classes were moved to core in version 0.6rc

class PimlicoDatatype(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶

Bases: object

The abstract superclass of all datatypes. Provides basic functionality for identifying where data should be stored and such.

Datatypes are used to specify the routines for reading the output from modules. They’re also used to specify how to read pipeline inputs. Most datatypes that have data simply read it in when required. Some (in particular those used as inputs) need a preparation phase to be run, where the raw data itself isn’t sufficient to implement the reading interfaces required. In this case, they should override prepare_data().

Datatypes may require/allow options to be set when they’re used to read pipeline inputs. These are specified, in the same way as module options, by input_module_options on the datatype class.

Datatypes may supply a set of additional datatypes. These should be guaranteed to be available if the main datatype is available. They must require no extra processing to be made available, unless that is done on the fly while reading the datatype (like a filter) or while the main datatype is being written.

Additional datatypes can be accessed in config files by specifying the main datatype (as a previous module, optionally with an output name) and the additional datatype name in the form main_datatype->additional_name. Multiple additional names may be given, causing the next name to be looked up as an additional datatype of the initially loaded additional datatype. E..g main_datatype->additional0->additional1.

To avoid conflicts in the metadata between datatypes using the same directory, datatypes loaded as additional datatypes have their additional name available to them and use it as a prefix to the metadata filename.

If use_main_metadata=True on an additional datatype, the same metadata will be read as for the main datatype to which this is an additional datatype.

module is the ModuleInfo instance for the pipeline module that this datatype was produced by. It may be None, if the datatype wasn’t instantiated by a module. It is not required to be set if you’re instantiating a datatype in some context other than module output. It should generally be set for input datatypes, though, since they are treated as being created by a special input module.

requires_data_preparation = False¶

input_module_options = {}¶: Override to provide shell commands specific to this datatype. Should include the superclass’ list.

shell_commands = []¶: List of additional datatypes provided by this one, given as (name, datatype class) pairs. For each of these, a call to get_additional_datatype(name) (once the main datatype is ready) should return a datatype instance that is also ready.

supplied_additional = []¶

Most datatype classes define their own type of corpus, which is often a subtype of some other. Some, however, emulate another type and it is that type that should be considered the be the type of the dataset, not the class itself.

For example, TarredCorpusFilter dynamically produces something that looks like a TarredCorpus, and further down the pipeline, if its type is need, it should be considered to be a TarredCorpus.

Most of the time, this can be left empty, but occasionally it needs to be set.

emulated_datatype = None¶

datatype_name = 'base_datatype'¶

metadata¶

Read in metadata from a file in the corpus directory.

Note that this is no longer cached in memory. We need to be sure that the metadata values returned are always up to date with what is on disk, so always re-read the file when we need to get a value from the metadata. Since the file is typically small, this is unlikely to cause a problem. If we decide to return to cacheing the metadata dictionary in future, we will need to make sure that we can never run into problems with out-of-date metadata being returned.

get_required_paths()[source]¶

Returns a list of paths to files that should be available for the data to be read. The base data_ready() implementation checks that these are all available and, if the datatype is used as an input to a pipeline and requires a data preparation routine to be run, data preparation will not be executed until these files are available.

Paths may be absolute or relative. If relative, they refer to files within the data directory and data_ready() will fail if the data dir doesn’t exist.

Returns:	list of absolute or relative paths

get_software_dependencies()[source]¶

Check that all software required to read this datatype is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

prepare_data(output_dir, log)[source]¶

classmethod create_from_options(base_dir, pipeline, options={}, module=None)[source]¶

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

get_detailed_status()[source]¶

Returns a list of strings, containing detailed information about the data. Only called if data_ready() == True.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

classmethod datatype_full_class_name()[source]¶: The fully qualified name of the class for this datatype, by which it is reference in config files. Generally, datatypes don’t need to override this, but type requirements that take the place of datatypes for type checking need to provide it.

instantiate_additional_datatype(name, additional_name)[source]¶: Default implementation just assumes the datatype class can be instantiated using the default constructor, with the same base dir and pipeline as the main datatype. Options given to the main datatype are passed down to the additional datatype.

classmethod check_type(supplied_type)[source]¶

Method used by datatype type-checking algorithm to determine whether a supplied datatype (given as a class, which is a subclass of PimlicoDatatype) is compatible with the present datatype, which is being treated as a type requirement.

Typically, the present class is a type requirement on a module input and supplied_type is the type provided by a previous module’s output.

The default implementation simply checks whether supplied_type is a subclass of the present class. Subclasses may wish to impose different or additional checks.

Parameters:	supplied_type – type provided where the present class is required, or datatype instance
Returns:	True if the check is successful, False otherwise

classmethod type_checking_name()[source]¶: Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Classes that override check_supplied_type() may want to override this too.

classmethod full_datatype_name()[source]¶: Returns a string/unicode name for the datatype that includes relevant sub-type information. The default implementation just uses the attribute datatype_name, but subclasses may have more detailed information to add. For example, iterable corpus types also supply information about the data-point type.

class DynamicOutputDatatype[source]¶

Bases: object

Types of module outputs may be specified as a subclass of PimlicoDatatype, or alternatively as an instance of DynamicOutputType. In this case, get_datatype() is called when the output datatype is needed, passing in the module info instance for the module, so that a specialized datatype can be produced on the basis of options, input types, etc.

The dynamic type must provide certain pieces of information needed for typechecking.

datatype_name = None¶

get_datatype(module_info)[source]¶

get_base_datatype_class()[source]¶

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return the class. By default, it returns None.

If this information is available, it will be used in documentation.

class DynamicInputDatatypeRequirement[source]¶

Bases: object

Types of module inputs may be given as a subclass of PimlicoDatatype, a tuple of datatypes, or an instance a DynamicInputDatatypeRequirement subclass. In this case, check_type(supplied_type) is called during typechecking to check whether the type that we’ve got conforms to the input type requirements.

Additionally, if datatype_doc_info is provided, it is used to represent the input type constraints in documentation.

datatype_doc_info = None¶

check_type(supplied_type)[source]¶

type_checking_name()[source]¶: Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Subclasses may want to override this too.

class MultipleInputs(datatype_requirements)[source]¶

Bases: object

An input datatype that can be used as an item in a module’s inputs, which lets the module accept an unbounded number of inputs, all satisfying the same datatype requirements. When writing the inputs in a config file, they can be specified as a comma-separated list of the usual type of specification (module name, with optional output name). Each item in the list must point to a datatype that satisfies the type-checking.

The list may also include (or entirely consist of) a base module name from the pipeline that has been expanded into multiple modules according to alternative parameters (the type separated by vertical bars, see Multiple parameter values). Use the notation *name, where name is the base module name, to denote all of the expanded module names as inputs. These are treated as if you’d written out all of the expanded module names separated by commas.

In a config file, if you need the same input specification to be repeated multiple times in a list, instead of writing it out explicitly you can use a multiplier to repeat it N times by putting *N after it. This is particularly useful when N is the result of expanding module variables, allowing the number of times an input is repeated to depend on some modvar expression.

When get_input() is called on the module, instead of returning a single datatype, a list of datatypes is returned.

class PimlicoDatatypeWriter(base_dir, additional_name=None)[source]¶

Bases: object

Abstract base class fo data writer associated with Pimlico datatypes.

require_tasks(*tasks)[source]¶: Add a name or multiple names to the list of output tasks that must be completed before writing is finished

task_complete(task)[source]¶

incomplete_tasks¶

write_metadata()[source]¶

subordinate_additional_name(name)[source]¶

class IterableCorpus(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatype

Superclass of all datatypes which represent a dataset that can be iterated over document by document (or datapoint by datapoint - what exactly we’re iterating over may vary, though documents are most common).

The actual type of the data depends on the subclass: it could be, e.g. coref output, etc. Information about the type of individual documents is provided by document_type and this is used in type checking.

At creation time, length should be provided in the metadata, denoting how many documents are in the dataset.

datatype_name = 'iterable_corpus'¶

data_point_type¶: alias of pimlico.datatypes.documents.RawDocumentType

shell_commands = [<pimlico.datatypes.base.CountInvalidCmd object>]¶

get_detailed_status()[source]¶

Returns a list of strings, containing detailed information about the data. Only called if data_ready() == True.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

classmethod check_type(supplied_type)[source]¶: Override type checking to require that the supplied type have a document type that is compatible with (i.e. a subclass of) the document type of this class.

classmethod type_checking_name()[source]¶: Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Classes that override check_supplied_type() may want to override this too.

classmethod full_datatype_name()[source]¶: Returns a string/unicode name for the datatype that includes relevant sub-type information. The default implementation just uses the attribute datatype_name, but subclasses may have more detailed information to add. For example, iterable corpus types also supply information about the data-point type.

process_document_data_with_datatype(data)[source]¶: Applies the corpus’ datatype’s process_document() method to the raw data :param data: :return:

class IterableCorpusWriter(base_dir, additional_name=None)[source]¶: Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

class InvalidDocument(module_name, error_info=None)[source]¶

Bases: object

Widely used in Pimlico to represent an empty document that is empty not because the original input document was empty, but because a module along the way had an error processing it. Document readers/writers should generally be robust to this and simply pass through the whole thing where possible, so that it’s always possible to work out, where one of these pops up, where the error occurred.

static load(text)[source]¶

static invalid_document_or_text(text)[source]¶: If the text represents an invalid document, parse it and return an InvalidDocument object. Otherwise, return the text as is.

exception DatatypeLoadError[source]¶: Bases: exceptions.Exception

exception DatatypeWriteError[source]¶: Bases: exceptions.Exception

load_datatype(path)[source]¶: Try loading a datatype class for a given path. Raises a DatatypeLoadError if it’s not a valid datatype path.