Tokenized corpus to ID mapper¶

Path	pimlico.modules.corpora.vocab_mapper
Executable	yes

Maps all the words in a tokenized textual corpus to integer IDs, storing just lists of integers in the output.

This is typically done before doing things like training models on textual corpora. It ensures that a consistent mapping from words to IDs is used throughout the pipeline. The training modules use this pre-mapped form of input, instead of performing the mapping as they read the data, because it is much more efficient if the corpus needs to be iterated over many times, as is typical in model training.

First use the vocab_builder module to construct the word-ID mapping and filter the vocabulary as you wish, then use this module to apply the mapping to the corpus.

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`TokenizedDocumentType`>
vocab	`dictionary`

Outputs¶

Name	Type(s)
ids	`grouped_corpus` <`IntegerListsDocumentType`>

Options¶

Name	Description	Type
oov	If given, special token to map all OOV tokens to. Otherwise, use vocab_size+1 as index. Special value ‘skip’ simply skips over OOV tokens	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_vocab_mapper_module]
type=pimlico.modules.corpora.vocab_mapper
input_text=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_vocab_mapper_module]
type=pimlico.modules.corpora.vocab_mapper
input_text=module_a.some_output
input_vocab=module_a.some_output
oov=value

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

vocab_mapper