pimlico.datatypes.dictionary module

This module implements the concept of Dictionary – a mapping between words and their integer ids.

The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel. We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just for one data structure.

class Dictionary(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Dictionary encapsulates the mapping between normalized words and their integer ids.

datatype_name = 'dictionary'
get_data()[source]
data_ready()[source]
get_detailed_status()[source]
class DictionaryWriter(base_dir)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

add_documents(documents, prune_at=2000000)[source]
filter(threshold=None, no_above=None, limit=None)[source]