pimlico.core.modules.base module

This module provides base classes for Pimlico modules.

The procedure for creating a new module is the same whether you’re contributing a module to the core set in the Pimlico codebase or a standalone module in your own codebase, or for a specific pipeline.

A Pimlico module is identified by the full Python-path to the Python package that contains it. This package should be laid out as follows:

  • The module’s metadata is defined by a class in info.py called ModuleInfo, which should inherit from BaseModuleInfo or one of its subclasses.
  • The module’s functionality is provided by a class in execute.py called ModuleExecutor, which should inherit from BaseModuleExecutor.

The exec Python module will not be imported until an instance of the module is to be run. This means that you can import dependencies and do any necessary initialization at the point where it’s executed, without worrying about incurring the associated costs (and dependencies) every time a pipeline using the module is loaded.

exception pimlico.core.modules.base.DependencyError(message, stderr=None, stdout=None)[source]

Bases: exceptions.Exception

Raised when a module’s dependencies are not satisfied. Generally, this means a dependency library needs to be installed, either on the local system or (more often) by calling the appropriate make target in the lib directory.

exception pimlico.core.modules.base.ModuleExecutorLoadError[source]

Bases: exceptions.Exception

exception pimlico.core.modules.base.ModuleInfoLoadError(*args, **kwargs)[source]

Bases: exceptions.Exception

exception pimlico.core.modules.base.ModuleTypeError[source]

Bases: exceptions.Exception

exception pimlico.core.modules.base.TypeCheckError[source]

Bases: exceptions.Exception

class pimlico.core.modules.base.BaseModuleExecutor(module_instance_info, stage=None, debug=False, force_rerun=False)[source]

Bases: object

Abstract base class for executors for Pimlico modules. These are classes that actually do the work of executing the module on given inputs, writing to given output locations.

execute()[source]

Run the actual module execution.

May return None, in which case it’s assumed to have fully completed. If a string is returned, it’s used as an alternative module execution status. Used, e.g., by multi-stage modules that need to be run multiple times.

class pimlico.core.modules.base.BaseModuleInfo(module_name, pipeline, inputs={}, options={}, optional_outputs=[], docstring='')[source]

Bases: object

Abstract base class for all pipeline modules’ metadata.

add_execution_history_record(line)[source]

Output a single line to the file that stores the history of module execution, so we can trace what we’ve done.

all_inputs_ready()[source]

Check input_ready() on all inputs.

Returns:True if all input datatypes are ready to be used
check_ready_to_run()[source]

Called before a module is run, or if the ‘check’ command is called. This will only be called after all library dependencies have been confirmed ready (see :method:get_software_dependencies).

Essentially, this covers any module-specific checks that used to be in check_runtime_dependencies() other than library installation (e.g. checking models exist).

Always call the super class’ method if you override.

Returns a list of (name, description) pairs, where the name identifies the problem briefly and the description explains what’s missing and (ideally) how to fix it.

classmethod extract_input_options(opt_dict, module_name=None, previous_module_name=None)[source]

Given the config options for a module instance, pull out the ones that specify where the inputs come from and match them up with the appropriate input names.

The inputs returned are just names as they come from the config file. They are split into module name and output name, but they are not in any way matched up with the modules they connect to or type checked.

Parameters:
  • module_name – name of the module being processed, for error output. If not given, the name isn’t included in the error.
  • previous_module_name – name of the previous module in the order given in the config file, allowing a single-input module to default to connecting to this if the input connection wasn’t given
Returns:

dictionary of inputs

get_absolute_output_dir(output_name)[source]
get_all_executed_modules()[source]

Returns a list of all the modules that will be executed when this one is (including itself). This is the current module (if executable), plus any filters used to produce its inputs.

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the module’s status that is specific to the module type. This may include module-specific information about execution status, for example.

Subclasses may override this to supply useful (human-readable) information specific to the module type. They should called the super method.

get_execution_dependency_tree()[source]

Tree of modules that will be executed when this one is executed. Where this module depends on filters, the tree goes back through them to find what they depend on (since they will be executed simultaneously)

static get_extra_outputs_from_options(options)[source]

Normally, which optional outputs get produced by a module depend on the ‘output’ option given in the config file, plus any outputs that get used by subsequent modules. By overriding this method, module types can add extra outputs into the list of those to be included, conditional on other options.

E.g. the corenlp module include the ‘annotations’ output if annotators are specified, so that the user doesn’t need to give both options.

get_input(input_name=None, always_list=False)[source]

Get a datatype instances corresponding to one of the inputs to the module. Looks up the corresponding output from another module and uses that module’s metadata to get that output’s instance. If an input name is not given, the first input is returned.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single datatype instance. If always_list=True, in this latter case we return a single-item list.

get_input_datatype(input_name=None, always_list=False)[source]

Get a list of datatype classes corresponding to one of the inputs to the module. If an input name is not given, the first input is returned.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single datatype.

get_input_module_connection(input_name=None, always_list=False)[source]

Get the ModuleInfo instance and output name for the output that connects up with a named input (or the first input) on this module instance. Used by get_input() – most of the time you probably want to use that to get the instantiated datatype for an input.

If the input type was specified with MultipleInputs, meaning that we’re expecting an unbounded number of inputs, this is a list. Otherwise, it’s a single (module, output_name) pair. If always_list=True, in this latter case we return a single-item list.

get_input_software_dependencies()[source]

Collects library dependencies from the input datatypes to this module, which will need to be satisfied for the module to be run.

Unlike get_software_dependencies(), it shouldn’t need to be overridden by subclasses, since it just collects the results of getting dependencies from the datatypes.

classmethod get_key_info_table()[source]

When generating module docs, the table at the top of the page is produced by calling this method. It should return a list of two-item lists (title + value). Make sure to include the super-class call if you override this to add in extra module-specific info.

get_metadata()[source]
get_module_output_dir(short_term_store=False)[source]

Gets the path to the base output dir to be used by this module, relative to the storage base dir. When outputting data, the storage base dir will always be the short term store path, but when looking for the output data other base paths might be explored, including the long term store.

Parameters:short_term_store – if True, return absolute path to output dir in short-term store (used for output)
Returns:path, relative to store base path, or if short_term_store=True absolute path to output dir
get_output(output_name=None, additional_names=None)[source]

Get a datatype instance corresponding to one of the outputs of the module.

get_output_datatype(output_name=None, additional_names=[])[source]
get_output_dir(output_name, short_term_store=False)[source]
get_software_dependencies()[source]

Check that all software required to execute this module is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

input_ready(input_name=None)[source]

Check whether the datatype is (or datatypes are) ready to go, corresponding to the named input.

Parameters:input_name – input to check
Returns:True if input is ready
instantiate_output_datatype(output_name, output_datatype)[source]

Subclasses may want to override this to provide special behaviour for instantiating particular outputs’ datatypes.

classmethod is_filter()[source]
classmethod is_input()[source]
is_locked()[source]
Returns:True is the module is currently locked from execution
is_multiple_input(input_name=None)[source]

Returns True if the named input (or default input if no name is given) is a MultipleInputs input, False otherwise. If it is, get_input() will return a list, otherwise it will return a single datatype.

load_executor()[source]

Loads a ModuleExecutor for this Pimlico module. Usually, this just involves calling load_module_executor(), but the default executor loading may be overridden for a particular module type by overriding this function. It should always return a subclass of ModuleExecutor, unless there’s an error.

lock()[source]

Mark the module as locked, so that it cannot be executed. Called when execution begins, to ensure that you don’t end up executing the same module twice simultaneously.

missing_data(input_names=None)[source]

Check whether all the input data for this module is available. If not, return a list strings indicating which outputs of which modules are not available. If it’s all ready, returns an empty list.

To check specific inputs, give a list of input names. To check all inputs, don’t specify input_names. To check the default input, give input_names=[None].

classmethod module_package_name()[source]

The package name for the module, which is used to identify it in config files. This is the package containing the info.py in which the ModuleInfo is defined.

classmethod process_config(config_dict, module_name=None, previous_module_name=None)[source]

Convenience wrapper to do all config processing from a dictionary of module config.

classmethod process_module_options(opt_dict)[source]

Parse the options in a dictionary (probably from a config file), checking that they’re valid for this model type.

Parameters:opt_dict – dict of options, keyed by option name
Returns:dict of options
reset_execution()[source]

Remove all output data and metadata from this module to make a fresh start, as if it’s never been executed.

May be overridden if a module has some side effect other than creating/modifying things in its output directory(/ies), but overridden methods should always call the super method. Occasionally this is necessary, but most of the time the base implementation is enough.

set_metadata_value(attr, val)[source]
set_metadata_values(val_dict)[source]
typecheck_inputs()[source]
unlock()[source]

Remove the execution lock on this module.

dependencies
Returns:list of names of modules that this one depends on for its inputs.
execution_history

Get the entire recorded execution history for this module. Returns an empty string if no history has been recorded.

execution_history_path
input_names
lock_path
main_module = None
metadata_filename
module_executable = True

If specified, this ModuleExecutor class will be used instead of looking one up in the exec Python module

module_executor_override = None

Usually None. In the case of stages of a multi-stage module, stores a pointer to the main module.

module_inputs = []

Specifies a list of (name, datatype class) pairs for outputs that are always written

module_optional_outputs = []

Whether the module should be executed Typically True for almost all modules, except input modules (though some of them may also require execution) and filters

module_options = {}
module_outputs = []

Specifies a list of (name, datatype class) pairs for outputs that are written only if they’re specified in the “output” option or used by another module

module_readable_name = None
module_type_name = None
output_names
status
pimlico.core.modules.base.check_type(provided_type, type_requirements)[source]

Type-checking algorithm for making sure outputs from modules connect up with inputs that they satisfy the requirements for.

pimlico.core.modules.base.load_module_executor(path_or_info)[source]

Utility for loading the executor class for a module from its full path. More or less just a wrapper around an import, with some error checking. Locates the executor by a standard procedure that involves checking for an “execute” python module alongside the info’s module.

Note that you shouldn’t generally use this directly, but instead call the load_executor() method on a module info (which will call this, unless special behaviour has been defined).

Parameters:path – path to Python package containing the module
Returns:class
pimlico.core.modules.base.load_module_info(path)[source]

Utility to load the metadata for a Pimlico pipeline module from its package Python path.

Parameters:path
Returns: