dictionary

This module implements the concept of a Dictionary – a mapping between words and their integer ids.

The implementation is based on Gensim, because Gensim is wonderful and there’s no need to reinvent the wheel. We don’t use Gensim’s data structure directly, because it’s unnecessary to depend on the whole of Gensim just for one data structure.

However, it is possible to retrieve a Gensim dictionary directly from the Pimlico data structure if you need to use it with Gensim.

class Dictionary(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Dictionary encapsulates the mapping between normalized words and their integer ids. This class is responsible for reading and writing dictionaries.

DictionaryData is the data structure itself, which is very closely related to Gensim’s dictionary.

datatype_name = 'dictionary'
datatype_supports_python2 = True
class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for Dictionary

get_data()[source]

Load the dictionary and return a DictionaryData object.

class Setup(datatype, data_paths)[source]

Bases: pimlico.datatypes.base.Setup

Setup class for Dictionary.Reader

get_required_paths()[source]

Require the dictionary file to be written

reader_type

alias of Dictionary.Reader

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the data.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

class Writer(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

When the context manager is created, a new, empty DictionaryData instance is created. You can build your dictionary by calling add_documents() on the writer, or accessing the dictionary data structure directly (via the data attribute), or simply replace it with a fully formed DictionaryData instance of your own, using the same instance.

You can specify a list/set of stopwords when instantiating the writer. These will be excluded from the dictionary if seen in the corpus.

add_documents(documents, prune_at=2000000)[source]
filter(threshold=None, no_above=None, limit=None)[source]
filter_high_low(threshold=None, no_above=None, limit=None)[source]
metadata_defaults = {}
writer_param_defaults = {}
run_browser(reader, opts)[source]

Browse the vocab simply by printing out all the words

class DictionaryData[source]

Bases: object

Dictionary encapsulates the mapping between normalized words and their integer ids. This is taken almost directly from Gensim.

We also store a set of stopwords. These can be set explicitly (see add_stopwords()), and will also include any words that are removed as a result of filters on the basis that they’re too common. This means that we can tell which words are OOV because we’ve never seen them (or not seen them often) and which are common but filtered.

id2token
keys()[source]

Return a list of all token ids.

refresh_id2token()[source]
add_stopwords(new_stopwords)[source]

Add some stopwords to the list.

Raises an error if a stopword is in the dictionary. We don’t remove the term here, because that would end up changing IDs of other words unexpectedly. Instead, we leave it to the user to ensure a stopword is removed before being added to the list.

Terms already in the stopword list will not be added to the dictionary later.

add_term(term)[source]

Add a term to the dictionary, without any occurrence count. Note that if you run threshold-based filters after adding a term like this, it will get removed.

add_documents(documents, prune_at=2000000)[source]

Update dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized strings (either utf8 or unicode).

This is a convenience wrapper for calling doc2bow on each document with allow_update=True, which also prunes infrequent words, keeping the total number of unique words <= prune_at. This is to save memory on very large inputs. To disable this pruning, set prune_at=None.

Keeps track of total documents added, rather than just those added in this call, to decide when to prune. Otherwise, making many calls with a small number of docs in each results in pruning on every call.

doc2bow(document, allow_update=False, return_missing=False)[source]

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)[source]

Filter out tokens that appear in

  1. fewer than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_high_low_extremes(no_below=5, no_above=0.5, keep_n=100000, add_stopwords=True)[source]

Filter out tokens that appear in

  1. fewer than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

This is the same as filter_extremes(), but returns a separate list of terms removed because they’re too frequent and those removed because they’re not frequent enough.

If add_stopwords=True (default), any frequent words filtered out will be added to the stopwords list.

filter_tokens(bad_ids=None, good_ids=None)[source]

Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.

bad_ids and good_ids are collections of word ids to be removed.

compactify()[source]

Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

as_gensim_dictionary()[source]

Convert to Gensim’s dictionary type, which this type is based on. If you call this, Gensim will be imported, so your code becomes dependent on having Gensim installed.

Returns:gensim dictionary