Token frequency counter¶
| Path | pimlico.modules.corpora.vocab_counter |
| Executable | yes |
Count the frequency of each token of a vocabulary in a given corpus (most often the corpus on which the vocabulary was built).
Note that this distribution is not otherwise available along with the vocabulary. It stores the document frequency counts - how many documents each token appears in - which may sometimes be a close enough approximation to the actual frequencies. But, for example, when working with character-level tokens, this estimate will be very poor.
The output will be a 1D array whose size is the length of the vocabulary, or
the length plus one, if oov_excluded=T (used if the corpus has been mapped
so that OOVs are represented by the ID vocab_size+1, instead of having a
special token).
Inputs¶
| Name | Type(s) |
|---|---|
| corpus | grouped_corpus <IntegerListsDocumentType> |
| vocab | dictionary |
Outputs¶
| Name | Type(s) |
|---|---|
| distribution | numpy_array |
Options¶
| Name | Description | Type |
|---|---|---|
| oov_excluded | Indicates that the corpus has been mapped so that OOVs are represented by the ID vocab_size+1, instead of having a special token in the vocab | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_vocab_counter_module]
type=pimlico.modules.corpora.vocab_counter
input_corpus=module_a.some_output
input_vocab=module_a.some_output
This example usage includes more options.
[my_vocab_counter_module]
type=pimlico.modules.corpora.vocab_counter
input_corpus=module_a.some_output
input_vocab=module_a.some_output
oov_excluded=T
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.