Writing Pimlico module types

Pimlico comes with a fairly large number of module types that you can use to run many standard NLP, data processing and ML tools over your datasets.

For some projects, this is all you need to do. However, often you’ll want to mix standard tools with your own code, for example, using the output from the tools. And, of course, there are many more tools you might want to run that aren’t built into Pimlico: you can still benefit from Pimlico’s framework for data handling, config files and so on.

For a detailed description of the structure of a Pimlico module, see Pimlico module structure. This guide takes you through building a simple module.

Note

In any case where a module will process a corpus one document at a time, you should write a document map module, which takes care of a lot of things for you, so you only need to say what to do with each document.

Todo

Module writing guide needs to be updated for new datatypes.

In particular, the executor example and datatypes in the module definition need to be updated.

Code layout

If you’ve followed the basic project setup guide, you’ll have a project with a directory structure like this:

myproject/
    pipeline.conf
    pimlico/
        bin/
        lib/
        src/
        ...
    src/
        python/

If you’ve not already created the src/python directory, do that now.

This is where your custom Python code will live. You can put all of your custom module types and datatypes in there and use them in the same way as you use the Pimlico core modules and datatypes.

Add this option to the [pipeline] section of your config file, so Pimlico knows where to find your code:

python_path=src/python

To follow the conventions used in Pimlico’s codebase, we’ll create the following package structure in src/python:

src/python/myproject/
    __init__.py
    modules/
        __init__.py
    datatypes/
        __init__.py

Write a module

A Pimlico module consists of a Python package with a special layout. Every module has a file info.py. This contains the definition of the module’s metadata: its inputs, outputs, options, etc.

Most modules also have a file execute.py, which defines the routine that’s called when it’s run. You should take care when writing info.py not to import any non-standard Python libraries or have any time-consuming operations that get run when it gets imported.

execute.py, on the other hand, will only get imported when the module is to be run, after dependency checks.

For the example below, let’s assume we’re writing a module called nmf and create the following directory structure for it:

src/python/myproject/modules/
    __init__.py
    nmf/
        __init__.py
        info.py
        execute.py

Easy start

To help you get started, Pimlico provides a wizard in the newmodule command.

This will ask you a series of questions, guiding you through the most common tasks in creating a new module. At the end, it will generate a template to get you started with your module’s code. You then just need to fill in the gaps and write the code for what the module actually does.

Read on to learn more about the structure of modules, including things not covered by the wizard.

Metadata

Module metadata (everything apart from what happens when it’s actually run) is defined in info.py as a class called ModuleInfo.

Here’s a sample basic ModuleInfo, which we’ll step through. (It’s based on the Scikit-learn matrix_factorization module.)

from pimlico.core.dependencies.python import PythonPackageOnPip
from pimlico.core.modules.base import BaseModuleInfo
from pimlico.datatypes.arrays import ScipySparseMatrix, NumpyArray


class ModuleInfo(BaseModuleInfo):
    module_type_name = "nmf"
    module_readable_name = "Sklearn non-negative matrix factorization"
    module_inputs = [("matrix", ScipySparseMatrix)]
    module_outputs = [("w", NumpyArray), ("h", NumpyArray)]
    module_options = {
        "components": {
            "help": "Number of components to use for hidden representation",
            "type": int,
            "default": 200,
        },
    }

    def get_software_dependencies(self):
        return super(ModuleInfo, self).get_software_dependencies() + \
               [PythonPackageOnPip("sklearn", "Scikit-learn")]

The ModuleInfo should always be a subclass of BaseModuleInfo. There are some subclasses that you might want to use instead (e.g., see Writing document map modules), but here we just use the basic one.

Certain class-level attributes should pretty much always be overridden:

  • module_type_name: A name used to identify the module internally
  • module_readable_name: A human-readable short description of the module
  • module_inputs: Most modules need to take input from another module (though not all)
  • module_outputs: Describes the outputs that the module will produce, which may then be used as inputs to another module

Inputs are given as pairs (name, type), where name is a short name to identify the input and type is the datatype that the input is expected to have. Here, and most commonly, this is a subclass of PimlicoDatatype and Pimlico will check that a dataset supplied for this input is either of this type, or has a type that is a subclass of this.

Here we take just a single input: a sparse matrix.

Outputs are given in a similar way. It is up to the module’s executor (see below) to ensure that these outputs get written, but here we describe the datatypes that will be produced, so that we can use them as input to other modules.

Here we produce two Numpy arrays, the factorization of the input matrix.

Dependencies: Since we require Scikit-learn to execute this module, we override get_software_dependencies() to specify this. As Scikit-learn is available through Pip, this is very easy: all we need to do is specify the Pip package name. Pimlico will check that Scikit-learn is installed before executing the module and, if not, allow it to be installed automatically.

Finally, we also define some options. The values for these can be specified in the pipeline config file. When the ModuleInfo is instantiated, the processed options will be available in its options attribute. So, for example, we can get the number of components (specified in the config file, or the default of 200) using info.options["components"].

Executor

Here is a sample executor for the module info given above, placed in the file execute.py.

from pimlico.core.modules.base import BaseModuleExecutor
from pimlico.datatypes.arrays import NumpyArrayWriter
from sklearn.decomposition import NMF

class ModuleExecutor(BaseModuleExecutor):
    def execute(self):
        input_matrix = self.info.get_input("matrix").array
        self.log.info("Loaded input matrix: %s" % str(input_matrix.shape))

        # Convert input matrix to CSR
        input_matrix = input_matrix.tocsr()
        # Initialize the transformation
        components = self.info.options["components"]
        self.log.info("Initializing NMF with %d components" % components)
        nmf = NMF(components)

        # Apply transformation to the matrix
        self.log.info("Fitting NMF transformation on input matrix" % transform_type)
        transformed_matrix = transformer.fit_transform(input_matrix)

        self.log.info("Fitting complete: storing H and W matrices")
        # Use built-in Numpy array writers to output results in an appropriate format
        with NumpyArrayWriter(self.info.get_absolute_output_dir("w")) as w_writer:
            w_writer.set_array(transformed_matrix)
        with NumpyArrayWriter(self.info.get_absolute_output_dir("h")) as h_writer:
            h_writer.set_array(transformer.components_)

The executor is always defined as a class in execute.py called ModuleExecutor. It should always be a subclass of BaseModuleExecutor (though, again, note that there are more specific subclasses and class factories that we might want to use in other circumstances).

The execute() method defines what happens when the module is executed.

The instance of the module’s ModuleInfo, complete with options from the pipeline config, is available as self.info. A standard Python logger is also available, as self.log, and should be used to keep the user updated on what’s going on.

Getting hold of the input data is done through the module info’s get_input() method. In the case of a Scipy matrix, here, it just provides us with the matrix as an attribute.

Then we do whatever our module is designed to do. At the end, we write the output data to the appropriate output directory. This should always be obtained using the get_absolute_output_dir() method of the module info, since Pimlico takes care of the exact location for you.

Most Pimlico datatypes provide a corresponding writer, ensuring that the output is written in the correct format for it to be read by the datatype’s reader. When we leave the with block, in which we give the writer the data it needs, this output is written to disk.

Pipeline config

Our module is now ready to use and we can refer to it in a pipeline config file. We’ll assume we’ve prepared a suitable Scipy sparse matrix earlier in the pipeline, available as the default output of a module called matrix. Then we can add section like this to use our new module:

[matrix]
...(Produces sparse matrix output)...

[factorize]
type=myproject.modules.nmf
components=300
input=matrix

Note that, since there’s only one input, we don’t need to give its name. If we had defined multiple inputs, we’d need to specify this one as input_matrix=matrix.

You can now run the module as part of your pipeline in the usual ways.

Skeleton new module

To make developing a new module a little quicker, here’s a skeleton module info and executor.

from pimlico.core.modules.base import BaseModuleInfo

class ModuleInfo(BaseModuleInfo):
    module_type_name = "NAME"
    module_readable_name = "READABLE NAME"
    module_inputs = [("NAME", REQUIRED_TYPE)]
    module_outputs = [("NAME", PRODUCED_TYPE)]
    # Delete module_options if you don't need any
    module_options = {
        "OPTION_NAME": {
            "help": "DESCRIPTION",
            "type": TYPE,
            "default": VALUE,
        },
    }

    def get_software_dependencies(self):
        return super(ModuleInfo, self).get_software_dependencies() + [
            # Add your own dependencies to this list
            # Remove this method if you don't need to add any
        ]
from pimlico.core.modules.base import BaseModuleExecutor

class ModuleExecutor(BaseModuleExecutor):
    def execute(self):
        input_data = self.info.get_input("NAME")
        self.log.info("MESSAGES")

        # DO STUFF

        with SOME_WRITER(self.info.get_absolute_output_dir("NAME")) as writer:
            # Do what the writer requires