Token frequency counter¶

Path	pimlico.modules.corpora.vocab_counter
Executable	yes

Count the frequency of each token of a vocabulary in a given corpus (most often the corpus on which the vocabulary was built).

Note that this distribution is not otherwise available along with the vocabulary. It stores the document frequency counts - how many documents each token appears in - which may sometimes be a close enough approximation to the actual frequencies. But, for example, when working with character-level tokens, this estimate will be very poor.

The output will be a 1D array whose size is the length of the vocabulary, or the length plus one, if oov_excluded=T (used if the corpus has been mapped so that OOVs are represented by the ID vocab_size+1, instead of having a special token).

Inputs¶

Name	Type(s)
corpus	TarredCorpus<IntegerListsDocumentType>
vocab	`Dictionary`

Outputs¶

Name	Type(s)
distribution	`NumpyArray`

Options¶

Name	Description	Type
oov_excluded	Indicates that the corpus has been mapped so that OOVs are represented by the ID vocab_size+1, instead of having a special token in the vocab	bool