pimlico.datatypes.dictionary module
This module implements the concept of Dictionary – a mapping between words and
their integer ids.
The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel.
We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just
for one data structure.
-
class
Dictionary
(base_dir, pipeline, **kwargs)[source]
Bases: pimlico.datatypes.base.PimlicoDatatype
Dictionary encapsulates the mapping between normalized words and their integer ids.
-
datatype_name
= 'dictionary'
-
get_data
()[source]
-
data_ready
()[source]
Check whether the data corresponding to this datatype instance exists and is ready to be read.
Default implementation just checks whether the data dir exists. Subclasses might want to add their own
checks, or even override this, if the data dir isn’t needed.
-
get_detailed_status
()[source]
Returns a list of strings, containing detailed information about the data.
Only called if data_ready() == True.
Subclasses may override this to supply useful (human-readable) information specific to the datatype.
They should called the super method.
-
class
DictionaryWriter
(base_dir)[source]
Bases: pimlico.datatypes.base.PimlicoDatatypeWriter
-
add_documents
(documents, prune_at=2000000)[source]
-
filter
(threshold=None, no_above=None, limit=None)[source]