Pipeline variants¶
You can create several different versions of a pipeline, called pipeline variants in a single config file. The data corresponding to each will be kept completely separate. This is useful when you want multiple versions of a pipeline that are almost identical, but have some small differences.
The most common use of this, though by no means the only, is to create a variant that is faster to run than the main pipeline for the purposes of quickly testing the whole pipeline during development.
Every pipeline has by default one variant, called main
. You define other variants
simply by using special directives to mark particular lines as belonging to a particular
variant. Lines with no variant marking will appear in all variants.
Loading variants¶
If you don’t specify otherwise when loading a pipeline, the main
variant will be
loaded. Use the --variant
parameter (or -v
) to specify another variant by
name:
./pimlico.sh mypipeline.conf -v smaller status
To see a list of all available variants of a particular pipeline, use the variants command:
./pimlico.sh mypipeline.conf variants
Variant directives¶
Directives are processed when a pipeline config file is read in, before the file is
parsed to build a pipeline. They are lines that begin with %%
, followed
by the directive name and any arguments. See Directives for details
of other directives.
variant
:This line will be included only when loading a particular variant of a pipeline.
The variant name is specified in the form:
variant:variant_name
. You may include the line in more than one variant by specifying multiple names, separated by commas (and no spaces). You can use the default variant “main”, so that the line will be left out of other variants. The rest of the line, after the directive and variant name(s) is the content that will be included in those variants.[my_module] type=path.to.module %%variant:main size=52 %%variant:smaller size=7
An alternative notation makes config files more readable. Instead of
%%variant:variant_name
, write%%(variant_name)
. So the above example becomes:[my_module] type=path.to.module %%(main) size=52 %%(smaller) size=7
novariant
:A line to be included only when not loading a variant of the pipeline. Equivalent to
variant:main
.[my_module] type=path.to.module %%novariant size=52 %%variant:smaller size=7
Example¶
The following example config file, defines one variant, small
, aside from the default
main
variant.
[pipeline]
name=myvariants
release=0.8
python_path=%(project_root)s/src/python
# Load a dataset
[input_data]
type=pimlico.modules.input.text.raw_text_files
files=%(home)s/data/*
# For the small version, we cut down the dataset to just 10 documents
# We don't need this module at all in the main variant
%%(small) [small_data]
%%(small) type=pimlico.modules.corpora.subset
%%(small) size=10
# Tokenize the text
# Control where the input data comes from in the different variants
# The main variant simply uses the full, uncut corpus
[tokenize]
type=pimlico.modules.text.simple_tokenize
%%(small) input=small_data
%%(main) input=input_data
The main variant will be loaded if you don’t specify otherwise. In this version the module
small_data
doesn’t exist at all and tokenize
takes its input from input_data
.
./pimlico.sh myvariants.conf status
You can load the small variant by giving its name on the command line. This includes the
small_data
module and tokenize
gets its input from there, making it much faster
to test.
./pimlico.sh myvariants.conf -v small status