Corpus vocab builder
Path |
pimlico.modules.corpora.vocab_builder |
Executable |
yes |
Builds a dictionary (or vocabulary) for a tokenized corpus. This is a data structure that assigns an integer
ID to every distinct word seen in the corpus, optionally applying thresholds so that some words are left out.
Similar to pimlico.modules.features.vocab_builder
, which builds two vocabs, one for terms and one for
features.
Options
Name |
Description |
Type |
threshold |
Minimum number of occurrences required of a term to be included |
int |
max_prop |
Include terms that occur in max this proportion of documents |
float |
limit |
Limit vocab size to this number of most common entries (after other filters) |
int |