Setting up a new project using Pimlico¶
You’ve decided to use Pimlico to implement a data processing pipeline. So, where do you start?
This guide steps through the basic setup of your project. You don’t have to do everything exactly as suggested here, but it’s a good starting point and follows Pimlico’s recommended procedures. It steps through the setup for a very basic pipeline.
A shorter version of this guide that zooms through the essential setup steps is also available.
System-wide configuration¶
Note
If you’ve used Pimlico before, you can skip this step.
Pimlico needs you to specify certain parameters regarding your local system.
Typically this is just
a file in your home directory called .pimlico
. More details.
It needs to know where to put output files as it executes. These settings apply to all Pimlico pipelines you run. Pimlico will make sure that different pipelines don’t interfere with each other’s output (provided you give them different names).
Most of the time, you only need to specify one storage location,
using the store
parameter in your local
config file. (You can specify multiple: more details.)
Create a file ~/.pimlico
that looks like this:
store=/path/to/storage/directory
All pipelines will use different subdirectories of this one.
Getting started with Pimlico¶
The procedure for starting a new Pimlico project, using the latest release, is very simple.
Create a new, empty directory to put your project in. Download newproject.py into the project directory.
Make sure you’ve got Python installed. Pimlico currently supports Python 2 and 3, but we strongly recommend using Python 3 unless you have old Python 2 code you need to run.
Choose a name for your project (e.g. myproject
) and run:
python newproject.py myproject
This fetches the latest version of Pimlico (now in the pimlico/
subdirectory)
and creates a basic config file, which will define your pipeline.
It also retrieves libraries that Pimlico needs to run. Other libraries required by specific pipeline modules will be installed as necessary when you use the modules.
Building the pipeline¶
You’ve now got a config file in myproject.conf
. This already includes a
pipeline
section, which gives the basic pipeline setup.
It will look something like this:
[pipeline]
name=myproject
release=<release number>
python_path=%(project_root)s/src/python
The name
needs to be distinct from any other pipelines that you run –
it’s what distinguishes the storage locations.
release
is the release of Pimlico that you’re using: it’s automatically
set to the latest one, which has been downloaded.
If you later try running the same pipeline with an updated version of Pimlico, it will work fine as long as it’s the same minor version (the second part). The minor-minor third part can be updated and may bring some improvements. If you use a higher minor version (e.g. 0.10.x when you started with 0.9.24), there may be backwards incompatible changes, so you’d need to update your config file, ensuring it plays nicely with the later Pimlico version.
Getting input¶
Now we add our first module to the pipeline. This reads input from a collection of text files. We use a small subset of the Europarl corpus as an example here. This can be simply adapted to reading the real Europarl corpus or any other corpus stored in this straightforward way.
Download and extract the small corpus from here
In the example below, we have extracted the files to a directory data/europarl_demo
in
the home directory.
[input_text]
type=pimlico.modules.input.text.raw_text_files
files=%(home)s/data/europarl_demo/*
Doing something: tokenization¶
Now, some actual linguistic processing, albeit somewhat uninteresting. Many NLP tools assume that their input has been divided into sentences and tokenized. To keep things simple, we use a very basic, regular expression-based tokenizer.
Notice that the output from the previous module feeds into the input for this one, which we specify simply by naming the module.
[tokenize]
type=pimlico.modules.text.simple_tokenize
input=input_text
Doing something more interesting: POS tagging¶
Many NLP tools rely on part-of-speech (POS) tagging. Again, we use OpenNLP, and a standard Pimlico module wraps the OpenNLP tool.
[pos-tag]
type=pimlico.modules.opennlp.pos
input=tokenize
Running Pimlico¶
Now we’ve got our basic config file ready to go. It’s a simple linear pipeline that goes like this:
read input docs -> group into batches -> tokenize -> POS tag
It’s now ready to load and inspect using Pimlico’s command-line interface.
Before we can run it, there’s one thing missing: the OpenNLP tokenizer module needs access to the OpenNLP tool. We’ll see below how Pimlico sorts that out for you.
Checking everything’s dandy¶
Now you can run the status
command to check that the pipeline can be loaded and see the list of modules.
./pimlico.sh myproject.conf status
To check that specific modules are ready to run, with all software dependencies installed, use the
run
command with --dry-run
(or --dry
) switch:
./pimlico.sh myproject.conf run tokenize --dry
Fetching dependencies¶
All the standard modules provide easy ways to get hold of their dependencies automatically, or as close as possible. Most of the time, all you need to do is tell Pimlico to install them.
You use the run
command, with a module name and --dry-run
, to check whether a module is ready to run.
./pimlico.sh myproject.conf run tokenize --dry
This will find that things aren’t quite ready yet, as the OpenNLP Java packages are not available. These are not distributed with Pimlico, since they’re only needed if you use an OpenNLP module.
When you run the run
command, Pimlico will offer to install the necessary
software for you. In this case, this involves downloading OpenNLP’s jar files
from its web repository to somewhere where the OpenNLP tokenizer module can find it.
Say yes and Pimlico will get everything ready. Simple as that!
There’s one more thing to do: the tools we’re using require statistical models. We can simply download the pre-trained English models from the OpenNLP website.
At present, Pimlico doesn’t yet provide a built-in way for the modules to do this, as it does with software libraries, but it does include a GNU Makefile to make it easy to do:
cd ~/myproject/pimlico/models
make opennlp
Note that the modules we’re using default to these standard, pre-trained models, which you’re now in a position to use. However, if you want to use different models, e.g. for other languages or domains, you can specify them using extra options in the module definition in your config file.
If there are any other library problems shown up by the dry run, you’ll need to address them before going any further.
Running the pipeline¶
What modules to run?¶
Pimlico suggests an order in which to run your modules. In our case, this is pretty obvious, seeing as our pipeline is entirely linear – it’s clear which ones need to be run before others.
./pimlico.sh myproject.conf status
The output also tells you the current status of each module. At the moment, all the modules are UNEXECUTED
.
You might be surprised to see that input-text
features in the list. This is because, although it just
reads the data out of a corpus on disk, there’s not quite enough information in the corpus, so we need to run the
module to collect a little bit of metadata from an initial pass over the corpus. Some input types need this, others
not. In this case, all we’re lacking is a count of the total number of documents in the corpus.
Note
To make running your pipeline even simpler, you can abbreviate the command by using a shebang in the
config file. Add a line at the top of myproject.conf
like this:
#!./pimlico.sh
Then make the conf file executable by running (on Linux):
chmod ug+x myproject.conf
Now you can run Pimlico for your pipeline by using the config file as an executable command:
./myproject.conf status
Running the modules¶
The modules can be run using the run
command and specifying the module by name. We do this manually for each module.
./pimlico.sh myproject.conf run input-text
./pimlico.sh myproject.conf run tokenize
./pimlico.sh myproject.conf run pos-tag
Adding custom modules¶
Most likely, for your project you need to do some processing not covered by the built-in Pimlico modules. At this point, you can start implementing your own modules, which you can distribute along with the config file so that people can replicate what you did.
The newproject.py
script has already created a directory where our custom source code will live: src/python
,
with some subdirectories according to the standard code layout, with module types and datatypes in separate
packages.
The template pipeline also already has an option python_path
pointing to this directory, so that Pimlico knows where to
find your code. Note that
the code’s in a subdirectory of that containing the pipeline config and we specify the custom code path relative to
the config file, so it’s easy to distribute the two together.
Now you can create Python modules or packages in src/python
, following the same conventions as the built-in modules
and overriding the standard base classes, as they do. The following articles tell you more about how to do this:
Your custom modules and datatypes can then simply be used in the config file as module types.