custom_module_example¶

This is an example Pimlico pipeline.

The complete config file for this example pipeline is below. Source file

A simple example pipeline that loads some textual data and tokenizes it, then performs some custom processing.

This is intended as an example of how to use your own code in a pipeline module. The pipeline contains some core modules, but also one the is defined especially for this pipeline: filter_prop_nns.

See the src/ subdirectory for the module’s code.

Pipeline config¶

# Options for the whole pipeline
[pipeline]
name=custom_module_example
# Specify the version of Pimlico this config is designed to work with
release=latest
# Here you can add path(s) to Python source directories the pipeline needs
# We need to do this here, since we use a custom module type
# The path is relative to this config file
python_path=src/

# Specify some things in variables at the top of the file, so they're easy to find
[vars]
# The main pipeline input dir is given here
# It's good to put all paths to input data here, so that it's easy for people to point
#  them to other locations
# Here we define where the example input text data can be found
text_path=%(pimlico_root)s/examples/data/input/bbc/data

# Read in the raw text files
[input_text]
type=pimlico.modules.input.text.raw_text_files
files=%(text_path)s/*

# Tokenize the text using a simple tokenizer from NLTK
[tokenize]
type=pimlico.modules.spacy.tokenize
input=input_text

# A rough simple filter to remove words that look like proper nouns
# This is here to demonstrate how to use a custom module that is not
#  part of Pimlico's core modules
[filter_prop_nns]
type=pim_example.modules.filter_prop_nns
input=tokenize

# Build a vocabulary from the words used in the resulting corpus
# This can later be used to map words in the corpus to IDs
[vocab]
type=pimlico.modules.corpora.vocab_builder
input=filter_prop_nns
# Only include words that occur at least 5 times
threshold=5

Modules¶

The following Pimlico module types are used in this pipeline:

pimlico.modules.spacy.tokenize

pimlico.modules.corpora.vocab_builder