pimlico.datatypes.dictionary module

This module implements the concept of Dictionary – a mapping between words and their integer ids.

The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel. We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just for one data structure.

class pimlico.datatypes.dictionary.Dictionary(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Dictionary encapsulates the mapping between normalized words and their integer ids.

data_ready()[source]
get_data()[source]
class pimlico.datatypes.dictionary.DictionaryWriter(base_dir)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

add_documents(documents, prune_at=2000000)[source]
filter(threshold=None, no_above=None, limit=None)[source]