pimlico.datatypes.embeddings module

class Vocab(word, index, count=0)[source]

Bases: object

A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s Vocab.

class Embeddings(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Datatype to store embedding vectors, together with their words. Based on Gensim’s KeyedVectors object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)

Provides a method to map to Gensim’s KeyedVectors type for compatibility.

Doesn’t provide all of the functionality of KeyedVectors, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.

vectors
normed_vectors
vector_size
index2vocab
index2word
vocab
word_vec(word)[source]

Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.

word_vecs(words)[source]

Accept multiple words as input. Returns the words’ representations in vector space, as a 1D numpy array.

to_keyed_vectors()[source]
class EmbeddingsWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

write_vectors(arr)[source]

Write out vectors from a Numpy array

write_word_counts(word_counts)[source]

Write out vocab from a list of words with counts.

Parameters:word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
write_vocab_list(vocab_items)[source]

Write out vocab from a list of vocab items (see Vocab).

Parameters:vocab_items – list of ``Vocab``s
write_keyed_vectors(*kvecs)[source]

Write both vectors and vocabulary straight from Gensim’s KeyedVectors data structure. Can accept multiple objects, which will then be concatenated in the output.