pimlico.datatypes.embeddings module

class Vocab(word, index, count=0)[source]

Bases: object

A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s Vocab.

class Embeddings(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Datatype to store embedding vectors, together with their words. Based on Gensim’s KeyedVectors object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)

Provides a method to map to Gensim’s KeyedVectors type for compatibility.

Doesn’t provide all of the functionality of KeyedVectors, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.

vectors
normed_vectors
vector_size
word_counts
index2vocab
index2word
vocab
word_vec(word)[source]

Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.

word_vecs(words)[source]

Accept multiple words as input. Returns the words’ representations in vector space, as a 1D numpy array.

to_keyed_vectors()[source]

NB: this assumes we’re using a recent version of Gensim (e.g. >=3.5.0). If you get problems with the KeyedVectors class, it’s probably because you need a later version than you have.

class EmbeddingsWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

write_vectors(arr)[source]

Write out vectors from a Numpy array

write_word_counts(word_counts)[source]

Write out vocab from a list of words with counts.

Parameters:word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
write_vocab_list(vocab_items)[source]

Write out vocab from a list of vocab items (see Vocab).

Parameters:vocab_items – list of Vocab s
write_keyed_vectors(*kvecs)[source]

Write both vectors and vocabulary straight from Gensim’s KeyedVectors data structure. Can accept multiple objects, which will then be concatenated in the output.

class TSVVecFiles(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]

Bases: pimlico.datatypes.files.NamedFileCollection

Embeddings stored in TSV files. This format is used by Tensorflow and can be used, for example, as input to the Tensorflow Projector.

It’s just a TSV file with each vector on a row, and another metadata TSV file with the names associated with the points and the counts. The counts are not necessary, so the metadata can be written without them if necessary.

datatype_name = 'tsv_vec_files'
filenames = ['embeddings.tsv', 'metadata.tsv']
class TSVVecFilesWriter(base_dir)[source]

Bases: pimlico.datatypes.files.NamedFileCollectionWriter

Write embeddings and their labels to TSV files, as used by Tensorflow.

filenames = ['embeddings.tsv', 'metadata.tsv']
write_vectors(array)[source]
write_vocab_with_counts(word_counts)[source]
write_vocab_without_counts(words)[source]