base

This module provides base classes for Pimlico modules.

The procedure for creating a new module is the same whether you’re contributing a module to the core set in the Pimlico codebase or a standalone module in your own codebase, or for a specific pipeline.

A Pimlico module is identified by the full Python-path to the Python package that contains it. This package should be laid out as follows:

  • The module’s metadata is defined by a class in info.py called ModuleInfo, which should inherit from BaseModuleInfo or one of its subclasses.
  • The module’s functionality is provided by a class in execute.py called ModuleExecutor, which should inherit from BaseModuleExecutor.

The exec Python module will not be imported until an instance of the module is to be run. This means that you can import dependencies and do any necessary initialization at the point where it’s executed, without worrying about incurring the associated costs (and dependencies) every time a pipeline using the module is loaded.

class BaseModuleInfo(module_name, pipeline, inputs={}, options={}, optional_outputs=[], docstring='', include_outputs=[], alt_expanded_from=None, alt_param_settings=[], module_variables={})[source]

Bases: object

Abstract base class for all pipeline modules’ metadata.

module_type_name = None
module_readable_name = None
module_options = {}
module_inputs = []

Specifies a list of (name, datatype instance) pairs for inputs that are always required

module_optional_inputs = []

Specifies a list of (name, datatype instance) pairs for optional inputs. The module’s execution may vary depending on what is provided. If these are not given, None is returned from get_input()

module_optional_outputs = []

Specifies a list of (name, datatype instance) pairs for outputs that are written only if they’re specified in the “output” option or used by another module

module_output_groups = []

List of output groups: (group_name, [output_name1, …]). Further groups may be added by build_output_groups().

module_executable = True

Whether the module should be executed Typically True for almost all modules, except input modules (though some of them may also require execution) and filters

module_executor_override = None

If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module

main_module = None

Usually None. In the case of stages of a multi-stage module, stores a pointer to the main module.

module_supports_python2 = False

Most core Pimlico modules support use in Python 2 and 3. Modules that do should set this to True. If it is False, the module is assumed to work only in Python 3.

Since Python 2 compatibility requires extra work from the programmer, this is False by default.

To check whether a module can be used in Python 2, call supports_python2(), which will check this and also input and output datatypes.

module_outputs = []

Specifies a list of (name, datatype instance) pairs for outputs that are always written

classmethod supports_python2()[source]
Returns:True if the module can be run in Python 2 and 3, False if it only supports Python 3.
load_executor()[source]

Loads a ModuleExecutor for this Pimlico module. Usually, this just involves calling load_module_executor(), but the default executor loading may be overridden for a particular module type by overriding this function. It should always return a subclass of ModuleExecutor, unless there’s an error.

classmethod get_key_info_table()[source]

When generating module docs, the table at the top of the page is produced by calling this method. It should return a list of two-item lists (title + value). Make sure to include the super-class call if you override this to add in extra module-specific info.

metadata_filename
get_metadata()[source]
set_metadata_value(attr, val)[source]
set_metadata_values(val_dict)[source]
status
execution_history_path
add_execution_history_record(line)[source]

Output a single line to the file that stores the history of module execution, so we can trace what we’ve done.

execution_history

Get the entire recorded execution history for this module. Returns an empty string if no history has been recorded.

input_names

All required inputs, first, then all supplied optional inputs

output_names
classmethod process_module_options(opt_dict)[source]

Parse the options in a dictionary (probably from a config file), checking that they’re valid for this model type.

Parameters:opt_dict – dict of options, keyed by option name
Returns:dict of options
classmethod extract_input_options(opt_dict, module_name=None, previous_module_name=None, module_expansions={})[source]

Given the config options for a module instance, pull out the ones that specify where the inputs come from and match them up with the appropriate input names.

The inputs returned are just names as they come from the config file. They are split into module name and output name, but they are not in any way matched up with the modules they connect to or type checked.

Parameters:
  • module_name – name of the module being processed, for error output. If not given, the name isn’t included in the error.
  • previous_module_name – name of the previous module in the order given in the config file, allowing a single-input module to default to connecting to this if the input connection wasn’t given
  • module_expansions – dictionary mapping module names to a list of expanded module names, where expansion has been performed as a result of alternatives in the parameters. Provided here so that the unexpanded names may be used to refer to the whole list of module names, where a module takes multiple inputs on one input parameter
Returns:

dictionary of inputs

static choose_optional_outputs_from_options(options, inputs)[source]

Normally, which optional outputs get produced by a module depend on the ‘output’ option given in the config file, plus any outputs that get used by subsequent modules. By overriding this method, module types can add extra outputs into the list of those to be included, conditional on other options.

It also receives the processed dictionary of inputs, so that the additional outputs can depend on what is fed into the input.

E.g. the corenlp module include the ‘annotations’ output if annotators are specified, so that the user doesn’t need to give both options.

Note that this does not provide additional output definitions, just a list of the optional outputs (already defined) that should be included among the outputs produced.

static get_extra_outputs_from_options(options, inputs)

Normally, which optional outputs get produced by a module depend on the ‘output’ option given in the config file, plus any outputs that get used by subsequent modules. By overriding this method, module types can add extra outputs into the list of those to be included, conditional on other options.

It also receives the processed dictionary of inputs, so that the additional outputs can depend on what is fed into the input.

E.g. the corenlp module include the ‘annotations’ output if annotators are specified, so that the user doesn’t need to give both options.

Note that this does not provide additional output definitions, just a list of the optional outputs (already defined) that should be included among the outputs produced.

provide_further_outputs()[source]

Called during instantiation, once inputs and options are available, to add a further list of module outputs that are dependent on inputs or options.

When overriding this, you can provide a new docstring, which will be used in the module docs to describe the extra conditional outputs that are added.

build_output_groups()[source]

Called during instantiation to produce a list of named groups of outputs. The list extends the statically define output groups in module_output_groups. You should use the static list unless you need to override this for conditionally added outputs.

Called after all input, options and output processing has been done, so the outputs in the attribute available_outputs are the final list of outputs that this module instance has.

Returns a list of groups, each specified as: (group_name, [output_name1, ...]).

May contain as many groups as necessary. They are not required to cover all the outputs and outputs may feature in multiple groups.

Should not include group “all”, which is always included by default.

If you override this, use the docstring to specify what output groups will get added and how they are named. The text will be used in the generated module docs.

is_output_group_name(group_name)[source]
get_output_group(group_name)[source]

Get the list of output names corresponding to the given output group name.

Raises a KeyError if the output group does not exist.

get_module_output_dir(absolute=False, short_term_store=None)[source]

Gets the path to the base output dir to be used by this module, relative to the storage base dir. When outputting data, the storage base dir will always be the short term store path, but when looking for the output data other base paths might be explored, including the long term store.

Kwarg short_term_store is included for backward compatibility, but outputs a deprecation warning.

Parameters:absolute – if True, return absolute path to output dir in output store
Returns:path, relative to store base path, or if absolute=True absolute path to output dir
get_absolute_output_dir(output_name)[source]

The simplest way to get hold of the directory to use to output data to for a given output. This is the usual way to get an output directory for an output writer.

The directory is an absolute path to a location in the Pimlico output storage location.

Parameters:output_name – the name of an output
Returns:the absolute path to the output directory to use for the named output
get_output_dir(output_name, absolute=False, short_term_store=None)[source]

Kwarg short_term_store is included for backward compatibility, but outputs a deprecation warning.

Parameters:
  • absolute – return an absolute path in the storage location used for output. If False (default), return a relative path, specified relative to the root of the Pimlico store used. This allows multiple stores to be searched for output
  • output_name – the name of an output
Returns:

the path to the output directory to use for the named output, which may be relative to the root of the Pimlico store in use (default) or an absolute path in the output store, depending on absolute

get_output_datatype(output_name=None)[source]

Get the datatype of a named output, or the default output. Returns an instance of the relevant PimlicoDatatype subclass. This can be used for typechecking and also for getting a reader for the output data, once it’s ready, by supplying it with the path to the data.

To get a reader for the output data, use get_output().

Parameters:output_name – output whose datatype to retrieve. Default output if not specified
Returns:
output_ready(output_name=None)[source]

Check whether the named output is ready to be read from one of its possible storage locations.

Parameters:output_name – output to check, or default output if not given
Returns:False if data is not ready to be read
instantiate_output_reader_setup(output_name, datatype)[source]

Produce a reader setup instance that will be used to prepare this reader. This provides functionality like checking that the data is ready to be read before the reader is instantiated.

The standard implementation uses the datatype’s methods to get its standard reader setup and reader, but some modules may need to override this to provide other readers.

output_name is provided so that overriding methods’ behaviour can be conditioned on which output is being fetched.

instantiate_output_reader(output_name, datatype, pipeline, module=None)[source]

Prepare a reader for a particular output. The default implementation is very simple, but subclasses may override this for cases where the normal process of creating readers has to be modified.

Parameters:
  • output_name – output to produce a reader for
  • datatype – the datatype for this output, already inferred
get_output_reader_setup(output_name=None)[source]
get_output(output_name=None)[source]

Get a reader corresponding to one of the outputs of the module. The reader will be that which corresponds to the output’s declared datatype and will read the data from any of the possible locations where it can be found.

If the data is not available in any location, raises a DataNotReadyError.

To check whether the data is ready without calling this, call output_ready().

get_output_writer(output_name=None, **kwargs)[source]

Get a writer instance for the given output. Kwargs will be passed through to the writer and used to specify metadata and writer params.

Parameters:
  • output_name – output to get writer for, or default output if left
  • kwargs
Returns:

is_multiple_input(input_name=None)[source]

Returns True if the named input (or default input if no name is given) is a MultipleInputs input, False otherwise. If it is, get_input() will return a list, otherwise it will return a single datatype.

get_input_module_connection(input_name=None, always_list=False)[source]

Get the ModuleInfo instance and output name for the output that connects up with a named input (or the first input) on this module instance. Used by get_input() – most of the time you probably want to use that to get the instantiated datatype for an input.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single (module, output_name) pair. If always_list=True, in this latter case we return a single-item list.

get_input_datatype(input_name=None, always_list=False)[source]

Get a list of datatype instances corresponding to one of the inputs to the module. If an input name is not given, the first input is returned.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single datatype.

get_input_reader_setup(input_name=None, always_list=False)[source]

Get reader setup for one of the inputs to the module. Looks up the corresponding output from another module and uses that module’s metadata to get that output’s instance. If an input name is not given, the first input is returned.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single datatype instance. If always_list=True, in this latter case we return a single-item list.

If the requested input name is an optional input and it has not been supplied, returns None.

You can get a reader for the input, once the data is ready to be read, by calling get_reader() on the setup object. Or use get_input() on the module.

get_input(input_name=None, always_list=False)[source]

Get a reader for one of the inputs to the module. Should only be called once the input data is ready to read. It’s therefore fine to call this from a module executor, since data availability has already been checked by this point.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single datatype instance. If always_list=True, in this latter case we return a single-item list.

If the requested input name is an optional input and it has not been supplied, returns None.

Similarly, if you run in preliminary mode, multiple inputs might produce None for some of their inputs if the data is not ready.

input_ready(input_name=None)[source]

Check whether the data is ready to go corresponding to the named input.

Parameters:input_name – input to check
Returns:True if input is ready
all_inputs_ready()[source]

Check input_ready() on all inputs.

Returns:True if all input datatypes are ready to be used
classmethod is_filter()[source]
missing_module_data()[source]

Reports missing data not associated with an input dataset.

Calling missing_data() reports any problems with input data associated with a particular input to this module. However, modules may also rely on data that does not come from one of their inputs. This happens primarily (perhaps solely) when a module option points to a data source. This might be the case with any module, but is particularly common among input reader modules, which have no inputs, but read data according to their options.

Returns:list of problems
missing_data(input_names=None, assume_executed=[], assume_failed=[], allow_preliminary=False)[source]

Check whether all the input data for this module is available. If not, return a list strings indicating which outputs of which modules are not available. If it’s all ready, returns an empty list.

To check specific inputs, give a list of input names. To check all inputs, don’t specify input_names. To check the default input, give input_names=[None]. If not checking a specific input, also checks non-input data (see missing_module_data()).

If assume_executed is given, it should be a list of module names which may be assumed to have been executed at the point when this module is executed. Any outputs from those modules will be excluded from the input checks for this module, on the assumption that they will have become available, even if they’re not currently available, by the time they’re needed.

If assume_executed is given, it should be a list of module names which should be assumed to have failed. If we rely on data from the output of one of them, instead of checking whether it’s available we simply assume it’s not.

Why do this? When running multiple modules in sequence, if one fails it is possible that its output datasets look like complete datasets. For example, a partially written iterable corpus may look like a perfectly valid corpus, which happens to be smaller than it should be. After the execution failure, we may check other modules to see whether it’s possible to run them. Then we need to know not to trust the output data from the failed module, even if it looks valid.

If allow_preliminary=True, for any inputs that are multiple inputs and have multiple connections to previous modules, consider them to be satisfied if at least one of their inputs is ready. The normal behaviour is to require all of them to be ready, but in a preliminary run this requirement is relaxed.

classmethod is_input()[source]
dependencies
Returns:list of names of modules that this one depends on for its inputs.
get_transitive_dependencies()[source]

Transitive closure of dependencies.

Returns:list of names of modules that this one recursively (transitively) depends on for its inputs.
typecheck_inputs()[source]
typecheck_input(input_name)[source]

Typecheck a single input. typecheck_inputs() calls this and is used for typechecking of a pipeline. This method returns the (or the first) satisfied input requirement, or raises an exception if typechecking failed, so can be handy separately to establish which requirement was met.

The result is always a list, but will contain only one item unless the input is a multiple input.

get_software_dependencies()[source]

Check that all software required to execute this module is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

get_input_software_dependencies()[source]

Collects library dependencies from the input datatypes to this module, which will need to be satisfied for the module to be run.

Unlike get_software_dependencies(), it shouldn’t need to be overridden by subclasses, since it just collects the results of getting dependencies from the datatypes.

get_output_software_dependencies()[source]

Collects library dependencies from the output datatypes to this module, which will need to be satisfied for the module to be run.

Unlike get_input_software_dependencies(), it may not be the case that all of these dependencies strictly need to be satisfied before the module can be run. It could be that a datatype can be written without satisfying all the dependencies needed to read it. However, we assume that dependencies of all output datatypes must be satisfied in order to run the module that writes them, since this is usually the case, and these are checked before running the module.

Unlike get_software_dependencies(), it shouldn’t need to be overridden by subclasses, since it just collects the results of getting dependencies from the datatypes.

check_ready_to_run()[source]

Called before a module is run, or if the ‘check’ command is called. This will only be called after all library dependencies have been confirmed ready (see :method:get_software_dependencies).

Essentially, this covers any module-specific checks that used to be in check_runtime_dependencies() other than library installation (e.g. checking models exist).

Always call the super class’ method if you override.

Returns a list of (name, description) pairs, where the name identifies the problem briefly and the description explains what’s missing and (ideally) how to fix it.

reset_execution()[source]

Remove all output data and metadata from this module to make a fresh start, as if it’s never been executed.

May be overridden if a module has some side effect other than creating/modifying things in its output directory(/ies), but overridden methods should always call the super method. Occasionally this is necessary, but most of the time the base implementation is enough.

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the module’s status that is specific to the module type. This may include module-specific information about execution status, for example.

Subclasses may override this to supply useful (human-readable) information specific to the module type. They should called the super method.

classmethod module_package_name()[source]

The package name for the module, which is used to identify it in config files. This is the package containing the info.py in which the ModuleInfo is defined.

get_execution_dependency_tree()[source]

Tree of modules that will be executed when this one is executed. Where this module depends on filters, the tree goes back through them to find what they depend on (since they will be executed simultaneously)

get_all_executed_modules()[source]

Returns a list of all the modules that will be executed when this one is (including itself). This is the current module (if executable), plus any filters used to produce its inputs.

lock_path
lock()[source]

Mark the module as locked, so that it cannot be executed. Called when execution begins, to ensure that you don’t end up executing the same module twice simultaneously.

unlock()[source]

Remove the execution lock on this module.

is_locked()[source]
Returns:True is the module is currently locked from execution
get_log_filenames(name='error')[source]

Get a list of all the log filenames of the given prefix that exist in the module’s output dir. They will be ordered according to their numerical suffixes (i.e. the order in which they were created).

Returns a list of (filename, num) tuples, where num is the numerical suffix as an int.

get_new_log_filename(name='error')[source]

Returns an absolute path that can be used to output a log file for this module. This is used for outputting error logs. It will always return a filename that doesn’t currently exist, so can be used multiple times to output multiple logs.

get_last_log_filename(name='error')[source]

Get the most recent error log that was created by a call to get_new_log_filename(). Returns an absolute path, or None if no matching files are found.

collect_unexecuted_dependencies(modules)[source]

Given a list of modules, checks through all the modules that they depend on to put together a list of modules that need to be executed so that the given list will be left in an executed state. The list includes the modules themselves, if they’re not fully executed, and unexecuted dependencies of any unexecuted modules (recursively).

Parameters:modules – list of ModuleInfo instances
Returns:list of ModuleInfo instances that need to be executed
collect_runnable_modules(pipeline, preliminary=False)[source]

Look for all unexecuted modules in the pipeline to find any that are ready to be executed. Keep collecting runnable modules, including those that will become runnable once we’ve run earlier ones in the list, to produce a list of a sequence of modules that could be set running now.

Parameters:pipeline – pipeline config
Returns:ordered list of runable modules. Note that it must be run in this order, as some might depend on earlier ones in the list
satisfies_typecheck(provided_type, type_requirements)[source]

Interface to Pimlico’s standard type checking (see check_type) that returns a boolean to say whether type checking succeeded or not.

check_type(provided_type, type_requirements)[source]

Type-checking algorithm for making sure outputs from modules connect up with inputs that they satisfy the requirements for.

type_checking_name(typ)[source]
class BaseModuleExecutor(module_instance_info, stage=None, debug=False, force_rerun=False)[source]

Bases: object

Abstract base class for executors for Pimlico modules. These are classes that actually do the work of executing the module on given inputs, writing to given output locations.

execute()[source]

Run the actual module execution.

May return None, in which case it’s assumed to have fully completed. If a string is returned, it’s used as an alternative module execution status. Used, e.g., by multi-stage modules that need to be run multiple times.

exception ModuleInfoLoadError(*args, **kwargs)[source]

Bases: Exception

exception ModuleExecutorLoadError[source]

Bases: Exception

exception ModuleTypeError[source]

Bases: Exception

exception TypeCheckError(*args, **kwargs)[source]

Bases: Exception

Pipeline type-check mismatch.

Full description of problem provided in error message. May optionally provide more detailed information about the input and output (source) that failed to match, the expected type and the received type, all as strings. Specify using kwargs input, source, required_type and provided_type.

format()[source]

Provide a nice visual format of the mismatch to help the user.

exception DependencyError(message, stderr=None, stdout=None)[source]

Bases: Exception

Raised when a module’s dependencies are not satisfied. Generally, this means a dependency library needs to be installed, either on the local system or (more often) by calling the appropriate make target in the lib directory.

load_module_executor(path_or_info)[source]

Utility for loading the executor class for a module from its full path. More or less just a wrapper around an import, with some error checking. Locates the executor by a standard procedure that involves checking for an “execute” python module alongside the info’s module.

Note that you shouldn’t generally use this directly, but instead call the load_executor() method on a module info (which will call this, unless special behaviour has been defined).

Parameters:path – path to Python package containing the module
Returns:class
load_module_info(path)[source]

Utility to load the metadata for a Pimlico pipeline module from its package Python path.

Parameters:path
Returns: