Corpus vocab builder¶

Path	pimlico.modules.corpora.vocab_builder
Executable	yes

Builds a dictionary (or vocabulary) for a tokenized corpus. This is a data structure that assigns an integer ID to every distinct word seen in the corpus, optionally applying thresholds so that some words are left out.

Similar to pimlico.modules.features.vocab_builder, which builds two vocabs, one for terms and one for features.

Inputs¶

Name	Type(s)
text	TarredCorpus<TokenizedDocumentType>

Outputs¶

Name	Type(s)
vocab	`Dictionary`

Options¶

Name	Description	Type
prune_at	Prune the dictionary if it reaches this size. Setting a lower value avoids getting stuck with too big a dictionary to be able to prune and slowing things down, but means that the final pruning will less accurately reflect the true corpus stats. Should be considerably higher than limit (if used). Set to 0 to disable. Default: 2M (Gensim’s default)	int
max_prop	Include terms that occur in max this proportion of documents	float
oov	Use the final index the represent chars that will be out of vocabulary after applying threshold/limit filters. Applied even if the count is 0. Represent OOVs using the given string in the vocabulary	string
limit	Limit vocab size to this number of most common entries (after other filters)	int
threshold	Minimum number of occurrences required of a term to be included	int
include	Ensure that certain words are always included in the vocabulary, even if they don’t make it past the various filters, or are never seen in the corpus. Give as a comma-separated list	comma-separated list of strings