Token frequency counter¶
Path | pimlico.modules.corpora.vocab_counter |
Executable | yes |
Count the frequency of each token of a vocabulary in a given corpus (most often the corpus on which the vocabulary was built).
Note that this distribution is not otherwise available along with the vocabulary. It stores the document frequency counts - how many documents each token appears in - which may sometimes be a close enough approximation to the actual frequencies. But, for example, when working with character-level tokens, this estimate will be very poor.
The output will be a 1D array whose size is the length of the vocabulary, or
the length plus one, if oov_excluded=T
(used if the corpus has been mapped
so that OOVs are represented by the ID vocab_size+1
, instead of having a
special token).
Inputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <IntegerListsDocumentType > |
vocab | dictionary |
Outputs¶
Name | Type(s) |
---|---|
distribution | numpy_array |
Options¶
Name | Description | Type |
---|---|---|
oov_excluded | Indicates that the corpus has been mapped so that OOVs are represented by the ID vocab_size+1, instead of having a special token in the vocab | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_vocab_counter_module]
type=pimlico.modules.corpora.vocab_counter
input_corpus=module_a.some_output
input_vocab=module_a.some_output
This example usage includes more options.
[my_vocab_counter_module]
type=pimlico.modules.corpora.vocab_counter
input_corpus=module_a.some_output
input_vocab=module_a.some_output
oov_excluded=T
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.