Setting up a new project using Pimlico¶
Setup guide has a lot that needs to be updated for the new datatypes system. I’ve updated up to Getting input.
You’ve decided to use Pimlico to implement a data processing pipeline. So, where do you start?
This guide steps through the basic setup of your project. You don’t have to do everything exactly as suggested here, but it’s a good starting point and follows Pimlico’s recommended procedures. It steps through the setup for a very basic pipeline.
A shorter version of this guide that zooms through the essential setup steps is also available.
Pimlico needs you to specify certain parameters regarding your local system.
Typically this is just
a file in your home directory called
.pimlico. More details.
It needs to know where to put output files as it executes. These settings apply to all Pimlico pipelines you run. Pimlico will make sure that different pipelines don’t interfere with each other’s output (provided you give them different names).
Most of the time, you only need to specify one storage location,
store parameter in your local
config file. (You can specify multiple: more details).
Create a file
~/.pimlico that looks like this:
All pipelines will use different subdirectories of this one.
Getting started with Pimlico¶
The procedure for starting a new Pimlico project, using the latest release, is very simple.
Create a new, empty directory to put your project in. Download newproject.py into the project directory.
Choose a name for your project (e.g.
myproject) and run:
python newproject.py myproject
This fetches the latest version of Pimlico (now in the
and creates a basic config file, which will define your pipeline.
It also retrieves libraries that Pimlico needs to run. Other libraries required by specific pipeline modules will be installed as necessary when you use the modules.
Building the pipeline¶
You’ve now got a config file in
myproject.conf. This already includes a
pipeline section, which gives the basic pipeline setup.
It will look something like this:
[pipeline] name=myproject release=<release number> python_path=%(project_root)s/src/python
name needs to be distinct from any other pipelines that you run –
it’s what distinguishes the storage locations.
release is the release of Pimlico that you’re using: it’s automatically
set to the latest one, which has been downloaded.
If you later try running the same pipeline with an updated version of Pimlico, it will work fine as long as it’s the same major version (the first digit). Otherwise, there may be backwards incompatible changes, so you’d need to update your config file, ensuring it plays nicely with the later Pimlico version.
Now we add our first module to the pipeline. This reads input from a collection of text files. We use a small subset of the Europarl corpus as an example here. This can be simply adapted to reading the real Europarl corpus or any other corpus stored in this straightforward way.
In the example below, we have extracted the files to a directory
the home directory.
[input-text] type=pimlico.modules.input.text.raw_text_files files=%(home)s/data/europarl_demo/*
Continue writing from here
Doing something: tokenization¶
Now, some actual linguistic processing, albeit somewhat uninteresting. Many NLP tools assume that their input has been divided into sentences and tokenized. The OpenNLP-based tokenization module does both of these things at once, calling OpenNLP tools.
Notice that the output from the previous module feeds into the input for this one, which we specify simply by naming the module.
[tokenize] type=pimlico.modules.opennlp.tokenize input=tar-grouper
Doing something more interesting: POS tagging¶
Many NLP tools rely on part-of-speech (POS) tagging. Again, we use OpenNLP, and a standard Pimlico module wraps the OpenNLP tool.
[pos-tag] type=pimlico.modules.opennlp.pos input=tokenize
Now we’ve got our basic config file ready to go. It’s a simple linear pipeline that goes like this:
read input docs -> group into batches -> tokenize -> POS tag
Before we can run it, there’s one thing missing: three of these modules have their own dependencies, so we need to get hold of the libraries they use. The input reader uses the Beautiful Soup python library and the tokenization and POS tagging modules use OpenNLP.
Checking everything’s dandy¶
Now you can run the
status command to check that the pipeline can be loaded and see the list of modules.
./pimlico.sh myproject.conf status
To check that specific modules are ready to run, with all software dependencies installed, use the
run command with
./pimlico.sh myproject.conf run tokenize --dry
With any luck, all the checks will be successful. There might be some missing software dependencies.
All the standard modules provide easy ways to get hold of their dependencies automatically, or as close as possible. Most of the time, all you need to do is tell Pimlico to install them.
run command, with a module name and
--dry-run, to check whether a module is ready to run.
./pimlico.sh myproject.conf run tokenize --dry
In this case, it will tell you that some libraries are missing, but they can be installed automatically. Simply issue
install command for the module.
./pimlico.sh myproject.conf install tokenize
Simple as that.
There’s one more thing to do: the tools we’re using require statistical models. We can simply download the pre-trained English models from the OpenNLP website.
At present, Pimlico doesn’t yet provide a built-in way for the modules to do this, as it does with software libraries, but it does include a GNU Makefile to make it easy to do:
cd ~/myproject/pimlico/models make opennlp
Note that the modules we’re using default to these standard, pre-trained models, which you’re now in a position to use. However, if you want to use different models, e.g. for other languages or domains, you can specify them using extra options in the module definition in your config file.
If there are any other library problems shown up by the dry run, you’ll need to address them before going any further.
Running the pipeline¶
What modules to run?¶
Pimlico suggests an order in which to run your modules. In our case, this is pretty obvious, seeing as our pipeline is entirely linear – it’s clear which ones need to be run before others.
./pimlico.sh myproject.conf status
The output also tells you the current status of each module. At the moment, all the modules are
You’ll notice that the
tar-grouper module doesn’t feature in the list. This is because it’s a filter –
it’s run on the fly while reading output from the previous module (i.e. the input), so doesn’t have anything to
You might be surprised to see that
input-text does feature in the list. This is because, although it just
reads the data out of a corpus on disk, there’s not quite enough information in the corpus, so we need to run the
module to collect a little bit of metadata from an initial pass over the corpus. Some input types need this, others
not. In this case, all we’re lacking is a count of the total number of documents in the corpus.
To make running your pipeline even simpler, you can abbreviate the command by using a shebang in the
config file. Add a line at the top of
myproject.conf like this:
Then make the conf file executable by running (on Linux):
chmod ug+x myproject.conf
Now you can run Pimlico for your pipeline by using the config file as an executable command:
Running the modules¶
The modules can be run using the
run command and specifying the module by name. We do this manually for each module.
./pimlico.sh myproject.conf run input-text ./pimlico.sh myproject.conf run tokenize ./pimlico.sh myproject.conf run pos-tag
Adding custom modules¶
Most likely, for your project you need to do some processing not covered by the built-in Pimlico modules. At this point, you can start implementing your own modules, which you can distribute along with the config file so that people can replicate what you did.
newproject.py script has already created a directory where our custom source code will live:
with some subdirectories according to the standard code layout, with module types and datatypes in separate
The template pipeline also already has an option
python_path pointing to this directory, so that Pimlico knows where to
find your code. Note that
the code’s in a subdirectory of that containing the pipeline config and we specify the custom code path relative to
the config file, so it’s easy to distribute the two together.
Now you can create Python modules or packages in
src/python, following the same conventions as the built-in modules
and overriding the standard base classes, as they do. The following articles tell you more about how to do this:
Your custom modules and datatypes can then simply be used in the config file as module types.