Tokenized corpus to ID mapper

Path pimlico.modules.corpora.vocab_mapper
Executable yes

Maps all the words in a tokenized textual corpus to integer IDs, storing just lists of integers in the output.

This is typically done before doing things like training models on textual corpora. It ensures that a consistent mapping from words to IDs is used throughout the pipeline. The training modules use this pre-mapped form of input, instead of performing the mapping as they read the data, because it is much more efficient if the corpus needs to be iterated over many times, as is typical in model training.

First use the vocab_builder module to construct the word-ID mapping and filter the vocabulary as you wish, then use this module to apply the mapping to the corpus.

Inputs

Name Type(s)
text grouped_corpus <TokenizedDocumentType>
vocab dictionary

Outputs

Name Type(s)
ids grouped_corpus <IntegerListsDocumentType>

Options

Name Description Type
oov If given, special token to map all OOV tokens to. Otherwise, use vocab_size+1 as index. Special value ‘skip’ simply skips over OOV tokens string
row_length_bytes The length of each row is stored, by default, using a 2-byte value. If your dataset contains very long lines, you can increase this to allow larger row lengths to be stored int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_vocab_mapper_module]
type=pimlico.modules.corpora.vocab_mapper
input_text=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_vocab_mapper_module]
type=pimlico.modules.corpora.vocab_mapper
input_text=module_a.some_output
input_vocab=module_a.some_output
oov=value
row_length_bytes=2

Example pipelines

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.