ID to tokenized corpus mapper¶

Path	pimlico.modules.corpora.vocab_unmapper
Executable	yes

Maps all the IDs in an integer lists corpus to their corresponding words in a vocabulary, producing a tokenized textual corpus.

This is the inverse of vocab_mapper, which maps words to IDs. Typically, the resulting integer IDs are used for model training, but sometimes we need to map in the opposite direction.

Inputs¶

Name	Type(s)
ids	`grouped_corpus` <`IntegerListsDocumentType`>
vocab	`dictionary`

Outputs¶

Name	Type(s)
text	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
oov	If given, assume the vocab_size+1 was used to represent out-of-vocabulary words and map this index to the given token. Special value ‘skip’ simply skips over vocab_size+1 indices	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_vocab_unmapper_module]
type=pimlico.modules.corpora.vocab_unmapper
input_ids=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_vocab_unmapper_module]
type=pimlico.modules.corpora.vocab_unmapper
input_ids=module_a.some_output
input_vocab=module_a.some_output
oov=value

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

vocab_unmapper