Corpus vocab builder

Path pimlico.modules.corpora.vocab_builder
Executable yes

Builds a dictionary (or vocabulary) for a tokenized corpus. This is a data structure that assigns an integer ID to every distinct word seen in the corpus, optionally applying thresholds so that some words are left out.

Similar to pimlico.modules.features.vocab_builder, which builds two vocabs, one for terms and one for features.


Name Type(s)
text TokenizedCorpus


Name Type(s)
vocab Dictionary


Name Description Type
threshold Minimum number of occurrences required of a term to be included int
max_prop Include terms that occur in max this proportion of documents float
limit Limit vocab size to this number of most common entries (after other filters) int