pimlico.datatypes.embeddings module¶
-
class
Vocab
(word, index, count=0)[source]¶ Bases:
object
A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s
Vocab
.
-
class
Embeddings
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Datatype to store embedding vectors, together with their words. Based on Gensim’s
KeyedVectors
object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)Provides a method to map to Gensim’s
KeyedVectors
type for compatibility.Doesn’t provide all of the functionality of
KeyedVectors
, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.-
vectors
¶
-
normed_vectors
¶
-
vector_size
¶
-
index2vocab
¶
-
index2word
¶
-
vocab
¶
-
word_vec
(word)[source]¶ Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.
-
-
class
EmbeddingsWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
-
write_word_counts
(word_counts)[source]¶ Write out vocab from a list of words with counts.
Parameters: word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
-