Tokenized corpus to ID mapper

Path pimlico.modules.corpora.vocab_mapper
Executable yes

Maps all the words in a tokenized textual corpus to integer IDs, storing just lists of integers in the output.

This is typically done before doing things like training models on textual corpora. It ensures that a consistent mapping from words to IDs is used throughout the pipeline. The training modules use this pre-mapped form of input, instead of performing the mapping as they read the data, because it is much more efficient if the corpus needs to be iterated over many times, as is typical in model training.

First use the vocab_builder module to construct the word-ID mapping and filter the vocabulary as you wish, then use this module to apply the mapping to the corpus.


Name Type(s)
text grouped_corpus <TokenizedDocumentType>
vocab dictionary


Name Type(s)
ids grouped_corpus <IntegerListsDocumentType>


Name Description Type
oov If given, special token to map all OOV tokens to. Otherwise, use vocab_size+1 as index. Special value ‘skip’ simply skips over OOV tokens string

Example config

This is an example of how this module can be used in a pipeline config file.


This example usage includes more options.


Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.