pimlico.datatypes.embeddings module¶
-
class
Vocab
(word, index, count=0)[source]¶ Bases:
object
A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s
Vocab
.
-
class
Embeddings
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Datatype to store embedding vectors, together with their words. Based on Gensim’s
KeyedVectors
object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)Provides a method to map to Gensim’s
KeyedVectors
type for compatibility.Doesn’t provide all of the functionality of
KeyedVectors
, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.-
vectors
¶
-
normed_vectors
¶
-
vector_size
¶
-
word_counts
¶
-
index2vocab
¶
-
index2word
¶
-
vocab
¶
-
word_vec
(word)[source]¶ Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.
-
-
class
EmbeddingsWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
-
write_word_counts
(word_counts)[source]¶ Write out vocab from a list of words with counts.
Parameters: word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
-
-
class
TSVVecFiles
(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.files.NamedFileCollection
Embeddings stored in TSV files. This format is used by Tensorflow and can be used, for example, as input to the Tensorflow Projector.
It’s just a TSV file with each vector on a row, and another metadata TSV file with the names associated with the points and the counts. The counts are not necessary, so the metadata can be written without them if necessary.
-
datatype_name
= 'tsv_vec_files'¶
-
filenames
= ['embeddings.tsv', 'metadata.tsv']¶
-
-
class
TSVVecFilesWriter
(base_dir)[source]¶ Bases:
pimlico.datatypes.files.NamedFileCollectionWriter
Write embeddings and their labels to TSV files, as used by Tensorflow.
-
filenames
= ['embeddings.tsv', 'metadata.tsv']¶
-