dictionary¶
This module implements the concept of a Dictionary – a mapping between words and their integer ids.
The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel. We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just for one data structure.
However, it is possible to retrieve a Gensim dictionary directly from the Pimlico data structure if you need to use it with Gensim.
-
class
Dictionary
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Dictionary encapsulates the mapping between normalized words and their integer ids. This class is responsible for reading and writing dictionaries.
DictionaryData
is the data structure itself, which is very closely related to Gensim’s dictionary.-
datatype_name
= 'dictionary'¶
-
datatype_supports_python2
= True¶
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.base.Reader
Reader class for Dictionary
-
get_data
()[source]¶ Load the dictionary and return a
DictionaryData
object.
-
class
Setup
(datatype, data_paths)[source]¶ Bases:
pimlico.datatypes.base.Setup
Setup class for Dictionary.Reader
-
reader_type
¶ alias of
Dictionary.Reader
-
-
-
class
Writer
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.Writer
When the context manager is created, a new, empty
DictionaryData
instance is created. You can build your dictionary by calling add_documents() on the writer, or accessing the dictionary data structure directly (via the data attribute), or simply replace it with a fully formedDictionaryData
instance of your own, using the same instance.You can specify a list/set of stopwords when instantiating the writer. These will be excluded from the dictionary if seen in the corpus.
-
metadata_defaults
= {}¶
-
writer_param_defaults
= {}¶
-
-
-
class
DictionaryData
[source]¶ Bases:
object
Dictionary encapsulates the mapping between normalized words and their integer ids. This is taken almost directly from Gensim.
We also store a set of stopwords. These can be set explicitly (see add_stopwords()), and will also include any words that are removed as a result of filters on the basis that they’re too common. This means that we can tell which words are OOV because we’ve never seen them (or not seen them often) and which are common but filtered.
-
id2token
¶
-
add_stopwords
(new_stopwords)[source]¶ Add some stopwords to the list.
Raises an error if a stopword is in the dictionary. We don’t remove the term here, because that would end up changing IDs of other words unexpectedly. Instead, we leave it to the user to ensure a stopword is removed before being added to the list.
Terms already in the stopword list will not be added to the dictionary later.
-
add_term
(term)[source]¶ Add a term to the dictionary, without any occurrence count. Note that if you run threshold-based filters after adding a term like this, it will get removed.
-
add_documents
(documents, prune_at=2000000)[source]¶ Update dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized strings (either utf8 or unicode).
This is a convenience wrapper for calling doc2bow on each document with allow_update=True, which also prunes infrequent words, keeping the total number of unique words <= prune_at. This is to save memory on very large inputs. To disable this pruning, set prune_at=None.
Keeps track of total documents added, rather than just those added in this call, to decide when to prune. Otherwise, making many calls with a small number of docs in each results in pruning on every call.
-
doc2bow
(document, allow_update=False, return_missing=False)[source]¶ Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
-
filter_extremes
(no_below=5, no_above=0.5, keep_n=100000)[source]¶ Filter out tokens that appear in
- fewer than no_below documents (absolute number) or
- more than no_above documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
-
filter_high_low_extremes
(no_below=5, no_above=0.5, keep_n=100000, add_stopwords=True)[source]¶ Filter out tokens that appear in
- fewer than no_below documents (absolute number) or
- more than no_above documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).
This is the same as filter_extremes(), but returns a separate list of terms removed because they’re too frequent and those removed because they’re not frequent enough.
If add_stopwords=True (default), any frequent words filtered out will be added to the stopwords list.
-
filter_tokens
(bad_ids=None, good_ids=None)[source]¶ Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.
bad_ids and good_ids are collections of word ids to be removed.
-
compactify
()[source]¶ Assign new word ids to all words.
This is done to make the ids more compact, e.g. after some tokens have been removed via
filter_tokens()
and there are gaps in the id series. Calling this method will remove the gaps.
-