pimlico.datatypes.dictionary module

This module implements the concept of Dictionary – a mapping between words and their integer ids.

The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel. We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just for one data structure.

class Dictionary(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Dictionary encapsulates the mapping between normalized words and their integer ids.

datatype_name = 'dictionary'
get_data()[source]
data_ready()[source]

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the data. Only called if data_ready() == True.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

class DictionaryWriter(base_dir)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

add_documents(documents, prune_at=2000000)[source]
filter(threshold=None, no_above=None, limit=None)[source]