pimlico.core.config module

Reading of various types of config files, in particular a pipeline config.

exception pimlico.core.config.PipelineCheckError(cause, *args, **kwargs)[source]

Bases: exceptions.Exception

exception pimlico.core.config.PipelineConfigParseError(*args, **kwargs)[source]

Bases: exceptions.Exception

exception pimlico.core.config.PipelineStructureError[source]

Bases: exceptions.Exception

class pimlico.core.config.PipelineConfig(name, pipeline_config, local_config, raw_module_configs, module_order, filename=None, variant='main', available_variants=[], log=None, all_filenames=None, module_docstrings={}, module_aliases={})[source]

Bases: object

Main configuration for a pipeline, read in from a config file.

Each section, except for vars and pipeline, defines a module instance in the pipeline. Some of these can be executed, others act as filters on the outputs of other modules, or input readers.

Each section that defines a module has a type parameter. Usually, this is a fully-qualified Python package name that leads to the module type’s Python code (that package containing the info Python module). A special type is alias. This simply defines a module alias – an alternative name for an already defined module. It should have exactly one other parameter, input, specifying the name of the module we’re aliasing.

Special sections:

  • vars:

    May contain any variable definitions, to be used later on in the pipeline. Further down, expressions like %(varname)s will be expanded into the value assigned to varname in the vars section.

  • pipeline:

    Main pipeline-wide configuration. The following options are required for every pipeline:

    • name: a single-word name for the pipeline, used to determine where files are stored
    • release: the release of Pimlico for which the config file was written. It is considered compatible with later minor versions of the same major release, but not with later major releases. Typically, a user receiving the pipeline config will get hold of an appropriate version of the Pimlico codebase to run it with.

    Other optional settings:

    • python_path: a path or paths, relative to the directory containing the config file, in which Python modules/packages used by the pipeline can be found. Typically, a config file is distributed with a directory of Python code providing extra modules, datatypes, etc. Multiple paths are separated by colons (:).

Special variable substitutions

Certain variable substitutions are always available, in addition to those defined in vars sections.

  • pimlico_root:

    Root directory of Pimlico, usually the directory pimlico/ within the project directory.

  • proejct_root:

    Root directory of the whole project. Current assumed to always be the parent directory of pimlico_root.

  • output_dir:

    Path to output dir (usually output in Pimlico root).

  • long_term_store:

    Long-term store base directory being used under the current config. Can be used to link to data from other pipelines run on the same system.

  • short_term_store:

    Short-term store base directory being used under the current config. Can be used to link to data from other pipelines run on the same system.

Directives:

Certain special directives are processed when reading config files. They are lines that begin with %%, followed by the directive name and any arguments.

  • variant:

    Allows a line to be included only when loading a particular variant of a pipeline. The variant name is specified as part of the directive in the form: variant:variant_name. You may include the line in more than one variant by specifying multiple names, separated by commas (and no spaces). You can use the default variant “main”, so that the line will be left out of other variants. The rest of the line, after the directive and variant name(s) is the content that will be included in those variants.

  • novariant:

    A line to be included only when not loading a variant of the pipeline. Equivalent to variant:main.

  • include:

    Include the entire contents of another file. The filename, specified relative to the config file in which the directive is found, is given after a space.

  • abstract:

    Marks a config file as being abstract. This means that Pimlico will not allow it to be loaded as a top-level config file, but only allow it to be included in another config file.

  • copy:

    Copies all config settings from another module, whose name is given as the sole argument. May be used multiple times in the same module and later copies will override earlier. Settings given explicitly in the module’s config override any copied settings. The following settings are not copied: input(s), filter, outputs, type.

Multiple parameter values:

Sometimes you want to write a whole load of modules that are almost identical, varying in just one or two parameters. You can give a parameter multiple values by writing them separated by vertical bars (|). The module definition will be expanded to produce a separate module for each value, with all the other parameters being identical.

You can even do this with multiple parameters of the same module and the expanded modules will cover all combinations of the parameter assignments.

Each module will be given a distinct name, based on the varied parameters. If just one is varied, the names will be of the form module_name{param_value}. If multiple parameters are varied at once, the names will be module_name{param_name0=param_value0~param_name1=param_value1~...).

static empty(local_config=None, override_local_config={}, override_pipeline_config={})[source]

Used to programmatically create an empty pipeline. It will contain no modules, but provides a gateway to system info, etc and can be used in place of a real Pimlico pipeline.

Parameters:
  • local_config
  • override_local_config
Returns:

find_all_data_paths(path)[source]
find_data_path(path, default=None)[source]

Given a path to a data dir/file relative to a data store, tries taking it relative to various store base dirs. If it exists in a store, that absolute path is returned. If it exists in no store, return None. If the path is already an absolute path, nothing is done to it.

The stores searched are the long-term store and the short-term store, though in the future more valid data storage locations may be added.

Parameters:
  • path – path to data, relative to store base
  • default – usually, return None if no data is found. If default=”short”, return path relative to short-term store in this case. If default=”long”, long-term store.
Returns:

absolute path to data, or None if not found in any store

get_dependent_modules(module_name, recurse=False)[source]

Return a list of the names of modules that depend on the named module for their inputs.

Parameters:recurse – include all transitive dependents, not just those that immediately depend on the module.
get_module_schedule()[source]

Work out the order in which modules should be executed. This is an ordering that respects dependencies, so that modules are executed after their dependencies, but otherwise follows the order in which modules were specified in the config.

Returns:list of module names
insert_module(module_info)[source]

Usually, all modules in the pipeline are loaded, based on config, by this class. However, occasionally, we may want to make modules available as part of the pipeline from elsewhere. In particular, this is necessary when building multi-stage modules – each stage is added (with special module name prefixes) into the main pipeline.

static load(filename, local_config=None, variant='main', override_local_config={})[source]
static load_local_config(filename=None, override={})[source]
load_module_info(module_name)[source]

Load the module metadata for a named module in the pipeline. Loads only this module’s data and nothing more.

Parameters:module_name
Returns:
path_relative_to_config(path)[source]

Get an absolute path to a file/directory that’s been specified relative to a config file (usually within the config file).

Parameters:path – relative path
Returns:absolute path
reset_all_modules()[source]

Resets the execution states of all modules, restoring the output dirs as if nothing’s been run.

module_dependencies

Dictionary mapping a module name to a list of the names of modules that it depends on for its inputs.

modules
pimlico.core.config.check_for_cycles(pipeline)[source]
pimlico.core.config.check_pipeline(pipeline)[source]

Checks a pipeline over for metadata errors, cycles and other problems. Called every time a pipeline is loaded, to check the whole pipeline’s metadata is in order.

pimlico.core.config.check_release(release_str)[source]
pimlico.core.config.get_dependencies(pipeline, modules, recursive=False)[source]

Get a list of software dependencies required by the subset of modules given.

If recursive=True, dependencies’ dependencies are added to the list too.

Parameters:
  • pipeline
  • modules – list of modules to check. If None, checks all modules
pimlico.core.config.multiply_alternatives(alternative_params)[source]
pimlico.core.config.preprocess_config_file(filename, variant='main', initial_vars={})[source]
pimlico.core.config.print_dependency_leaf_problems(dep)[source]
pimlico.core.config.print_missing_dependencies(pipeline, modules)[source]

Check runtime dependencies for a subset of modules and output a table of missing dependencies.

Parameters:
  • pipeline
  • modules – list of modules to check. If None, checks all modules
Returns:

True if no missing dependencies, False otherwise

pimlico.core.config.var_substitute(option_val, vars)[source]