base¶

Datatypes provide interfaces for reading and writing datasets. They provide different ways of reading in or iterating over datasets and different ways to write out datasets, as appropriate to the datatype. They are used by Pimlico to typecheck connections between modules to make sure that the output from one module provides a suitable type of data for the input to another. They are then also used by the modules to read in their input data coming from earlier in a pipeline and to write out their output data, to be passed to later modules.

See Datatypes for a guide to how Pimlico datatypes work.

This module defines the base classes for all datatypes.

class PimlicoDatatype(*args, **kwargs)[source]¶

Bases: object

The abstract superclass of all datatypes. Provides basic functionality for identifying where data should be stored and such.

Datatypes are used to specify the routines for reading the output from modules, via their reader class.

module is the ModuleInfo instance for the pipeline module that this datatype was produced by. It may be None, if the datatype wasn’t instantiated by a module. It is not required to be set if you’re instantiating a datatype in some context other than module output. It should generally be set for input datatypes, though, since they are treated as being created by a special input module.

If you’re creating a new datatype, refer to the datatype documentation.

datatype_options = {}¶

Options specified in the same way as module options that control the nature of the datatype. These are not things to do with reading of specific datasets, for which the dataset’s metadata should be used. These are things that have an impact on typechecking, such that options on the two checked datatypes are required to match for the datatypes to be considered compatible.

They should always be an ordered dict, so that they can be specified using positional arguments as well as kwargs and config parameters.

shell_commands = []¶: Override to provide shell commands specific to this datatype. Should include the superclass’ list.

datatype_supports_python2 = True¶

Most core Pimlico datatypes support use in Python 2 and 3. Datatypes that do should set this to True. If it is False, the datatype is assumed to work only in Python 3.

Python 2 compatibility requires extra work from the programmer. Datatypes should generally declare whether or not they provide this support by overriding this explicitly.

Use supports_python2() to check whether a datatype instance supports Python 2. (There may be reasons for a datatype’s instance to override this class-level setting.)

datatype_name = 'base_datatype'¶: Identifier (without spaces) to distinguish this datatype

supports_python2()[source]¶: By default, just returns cls.datatype_supports_python2. Subclasses might override this.

get_software_dependencies()[source]¶

Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().

get_writer_software_dependencies()[source]¶

Get a list of all software required to write this datatype using its Writer. This works in a similar way to get_software_dependencies() (for the Reader) and the dependencies will be check before the writer is instantiated.

It is assumed that all the reader’s dependencies also apply to the writer, so this method only needs to specify any additional dependencies the writer has.

You should call the super method for checking superclass dependencies.

get_writer(base_dir, pipeline, module=None, **kwargs)[source]¶

Instantiate a writer to write data to the given base dir.

Kwargs are passed through to the writer and used to specify initial metadata and writer params.

Parameters:	base_dir – output dir to write dataset to pipeline – current pipeline module – module name (optional, for debugging only)
Returns:	instance of the writer subclass corresponding to this datatype

classmethod instantiate_from_options(options={})[source]¶: Given string options e.g. from a config file, perform option processing and instantiate datatype

classmethod datatype_full_class_name()[source]¶: The fully qualified name of the class for this datatype, by which it is reference in config files. Generally, datatypes don’t need to override this, but type requirements that take the place of datatypes for type checking need to provide it.

check_type(supplied_type)[source]¶

Method used by datatype type-checking algorithm to determine whether a supplied datatype (given as an instance of a subclass of PimlicoDatatype) is compatible with the present datatype, which is being treated as a type requirement.

Typically, the present class is a type requirement on a module input and supplied_type is the type provided by a previous module’s output.

The default implementation simply checks whether supplied_type is a subclass of the present class. Subclasses may wish to impose different or additional checks.

Parameters:	supplied_type – type provided where the present class is required, or datatype instance
Returns:	True if the check is successful, False otherwise

type_checking_name()[source]¶: Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Classes that override check_supplied_type() may want to override this too.

full_datatype_name()[source]¶: Returns a string/unicode name for the datatype that includes relevant sub-type information. The default implementation just uses the attribute datatype_name, but subclasses may have more detailed information to add. For example, iterable corpus types also supply information about the data-point type.

run_browser(reader, opts)[source]¶

Launches a browser interface for reading this datatype, browsing the data provided by the given reader.

Not all datatypes provide a browser. For those that don’t, this method should raise a NotImplementedError.

opts provides the argparser options from the command line.

This tool used to be only available for iterable corpora, but now it’s possible for any datatype to provide a browser. IterableCorpus provides its own browser, as before, which uses one of the data point type’s formatters to format documents.

class Reader(datatype, setup, pipeline, module=None)[source]¶

Bases: object

The abstract superclass of all dataset readers.

You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each datatype. You can add functionality to a datatype’s reader by creating a nested Reader class. This will inherit from the parent datatype’s reader. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:

class MyDatatype(PimlicoDatatype):
    class Reader:
        # Override reader things here

process_setup()[source]¶: Do any processing of the setup object (e.g. retrieving values and setting attributes on the reader) that should be done when the reader is instantiated.

get_detailed_status()[source]¶

Returns a list of strings, containing detailed information about the data.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

class Setup(datatype, data_paths)[source]¶

Bases: object

Abstract superclass of all dataset reader setup classes.

See Datatypes for a information about how this class is used.

These classes provide any functionality relating to a reader needed before it is ready to read and instantiated. Most importantly, it provides the ready_to_read() method, which indicates whether the reader is ready to be instantiated.

The standard implementation, which can be used in almost all cases, takes a list of possible paths to the dataset at initialization and checks whether the dataset is ready to be read from any of them. You generally don’t need to override ready_to_read() with this, but just data_ready(), which checks whether the data is ready to be read in a specific location. You can call the parent class’ data-ready checks using super: super(MyDatatype.Reader.Setup, self).data_ready().

The whole Setup object will be passed to the corresponding Reader’s init, so that it has access to data locations, etc.

Subclasses may take different init args/kwargs and store whatever attributes are relevant for preparing their corresponding Reader. In such cases, you will usually override a ModuleInfo’s get_output_reader_setup() method for a specific output’s reader preparation, to provide it with the appropriate arguments. Do this by calling the Reader class’ get_setup(*args, **kwargs) class method, which passes args and kwargs through to the Setup’s init.

You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each reader type. You can add functionality to a reader’s setup by creating a nested Setup class. This will inherit from the parent reader’s setup. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:

class MyDatatype(PimlicoDatatype):
    class Reader:
        # Overide reader things here

        class Setup:
            # Override setup things here
            # E.g.:
            def data_ready(path):
                # Parent checks: usually you want to do this
                if not super(MyDatatype.Reader.Setup, self).data_ready(path):
                   return False
                # Check whether the data's ready according to our own criteria
                # ...
                return True

The first arg to the init should always be the datatype instance.

reader_type¶: alias of PimlicoDatatype.Reader

data_ready(path)[source]¶

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

ready_to_read()[source]¶

Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.

Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.

Returns:	True if the reader’s ready to be instantiated, False otherwise

get_required_paths()[source]¶: May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

get_base_dir()[source]¶

Returns:	the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.

get_data_dir()[source]¶

Returns:	the path to the data dir within the base dir (typically a dir called “data”)

read_metadata(base_dir)[source]¶: Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.

get_reader(pipeline, module=None)[source]¶

Instantiate a reader using this setup.

Parameters:	pipeline – currently loaded pipeline module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output

classmethod get_setup(datatype, *args, **kwargs)[source]¶: Instantiate a reader setup object for this reader. The args and kwargs are those of the reader’s corresponding setup class and will be passed straight through to the init.

metadata¶

Read in metadata from a file in the corpus directory.

Note that this is no longer cached in memory. We need to be sure that the metadata values returned are always up to date with what is on disk, so always re-read the file when we need to get a value from the metadata. Since the file is typically small, this is unlikely to cause a problem. If we decide to return to cacheing the metadata dictionary in future, we will need to make sure that we can never run into problems with out-of-date metadata being returned.

class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]¶

Bases: object

The abstract superclass of all dataset writers.

You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each datatype. You can add functionality to a datatype’s writer by creating a nested Writer class. This will inherit from the parent datatype’s writer. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:

class MyDatatype(PimlicoDatatype):
    class Writer:
        # Overide writer things here

Writers should be used as context managers. Typically, you will get hold of a writer for a module’s output directly from the module-info instance:

with module.get_output_writer("output_name") as writer:
    # Call the writer's methods, set its attributes, etc
    writer.do_something(my_data)
    writer.some_attr = "This data"

Any additional kwargs passed into the writer (which you can do by passing kwargs to get_output_writer() on the module) will set values in the dataset’s metadata. Available parameters are given, along with their default values, in the dictionary metadata_defaults on a Writer class. They also include all values from ancestor writers.

It is important to pass in parameters as kwargs that affect the writing of the data, to ensure that the correct values are available as soon as the writing process starts.

All metadata values, including those passed in as kwargs, should be serializable as simple JSON types.

Another set of parameters, writer params, is used to specify things that affect the writing process, but do not need to be stored in the metadata. This could be, for example, the number of CPUs to use for some part of the writing process. Unlike, for example, the format of the stored data, this is not needed later when the data is read.

Available writer params are given, along with their default values, in the dictionary writer_param_defaults on a Writer class. (They do not need to be JSON serializable.) Their values are also specified as kwargs in the same way as metadata.

metadata_defaults = {}¶

writer_param_defaults = {}¶

required_tasks = []¶: This can be overriden on writer classes to add this list of tasks to the required tasks when the writer is initialized

require_tasks(*tasks)[source]¶: Add a name or multiple names to the list of output tasks that must be completed before writing is finished

task_complete(task)[source]¶: Mark the named task as completed

incomplete_tasks¶: List of required tasks that have not yet been completed

write_metadata()[source]¶

class DynamicOutputDatatype[source]¶

Bases: object

Types of module outputs may be specified as an instance of a subclass of PimlicoDatatype, or alternatively as an instance of DynamicOutputType. In this case, get_datatype() is called when the output datatype is needed, passing in the module info instance for the module, so that a specialized datatype can be produced on the basis of options, input types, etc.

The dynamic type must provide certain pieces of information needed for typechecking.

If a base datatype is available (i.e. indication of the datatype before the module is instantiated), we take the information regarding whether the datatype supports Python 2 from there. If not, we assume it does. This may seems the opposite to other places: for example, the base datatype says it does not support Python 2 and subclasses must declare if they do. However, dynamic output datatypes are often used with modules that work with a broad range of input datatypes. It is therefore wrong to say that they do not support Python 2, since they will provided the input module does.

datatype_name = None¶

get_datatype(module_info)[source]¶

get_base_datatype()[source]¶

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return a datatype instance. By default, it returns None.

If this information is available, it will be used in documentation.

supports_python2()[source]¶

class DynamicInputDatatypeRequirement[source]¶

Bases: object

Types of module inputs may be given as an instance of a subclass of PimlicoDatatype, a tuple of datatypes, or an instance a DynamicInputDatatypeRequirement subclass. In this case, check_type(supplied_type) is called during typechecking to check whether the type that we’ve got conforms to the input type requirements.

Additionally, if datatype_doc_info is provided, it is used to represent the input type constraints in documentation.

datatype_doc_info = None¶

check_type(supplied_type)[source]¶

type_checking_name()[source]¶: Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Subclasses may want to override this too.

class MultipleInputs(datatype_requirements)[source]¶

Bases: object

A wrapper around an input datatype that can be used as an item in a module’s inputs, which lets the module accept an unbounded number of inputs, all satisfying the same datatype requirements.

When writing the inputs in a config file, they can be specified as a comma-separated list of the usual type of specification (module name, with optional output name). Each item in the list must point to a dataset (module output) that satisfies the type-checking for the wrapped datatype.

[module3]
type=pimlico.modules.some_module
input_datasets=module1.the_output,module2.the_output

Here module1’s output the_output and module2’s output the_output must both be of valid types for the multiple-input datasets to this module.

The list may also include (or entirely consist of) a base module name from the pipeline that has been expanded into multiple modules according to alternative parameters (the type separated by vertical bars, see Multiple parameter values). You can use the notation *name, where name is the base module name, to denote all of the expanded module names as inputs. These are treated as if you’d written out all of the expanded module names separated by commas.

[module1]
type=pimlico.modules.any_module
param={case1}first value for param|{case2}second value

[module3]
type=pimlico.modules.some_module
input_datasets=*module1.the_output

Here module1 will be expanded into module1[case1] and module1[case2], each having a different value for option param. The *-notation is a shorthand to say that the input datasets should get the output the_output from both of these alternatives, as if you had written module1[case1].the_output,module1[case2].the_output.

If a module provides multiple outputs, all of a suitable type, that you want to feed into the same (multiple-input) input, you can specify a list of all of the module’s outputs using the notation module_name.*.

# This module provides two outputs, output1 and output2
[module2]
type=pimlico.modules.multi_output_module

[module3]
type=pimlico.modules.some_module
input_datasets=module2.*

is equivalent to:

[module3]
type=pimlico.modules.some_module
input_datasets=module2.output1,module2.output2

If you need the same input specification to be repeated multiple times in a list, instead of writing it out explicitly you can use a multiplier to repeat it N times by putting *N after it. This is particularly useful when N is the result of expanding module variables, allowing the number of times an input is repeated to depend on some modvar expression.

[module3]
type=pimlico.modules.some_module
input_datasets=module1.the_output*3

is equivalent to:

[module3]
type=pimlico.modules.some_module
input_datasets=module1.the_output,module1.the_output,module1.the_output

When get_input() is called on the module info, if multiple inputs have been provided, instead of returning a single dataset reader, a list of readers is returned. You can use get_input(input_name, always_list=True) to always return a list of readers, even if only a single dataset was given as input. This is usually the best way to handle multiple inputs in module code.

supports_python2()[source]¶

exception DatatypeLoadError[source]¶: Bases: Exception

exception DatatypeWriteError[source]¶: Bases: Exception