Word2vec embedding trainer¶

Path	pimlico.modules.embeddings.word2vec
Executable	yes

Word2vec embedding learning algorithm, using Gensim’s implementation.

Find out more about word2vec.

This module is simply a wrapper to call Gensim Python (+C)’s implementation of word2vec on a Pimlico corpus.

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`TokenizedDocumentType`>

Name	Type(s)
model	`embeddings`

Name	Description	Type
iters	number of iterations over the data to perform. Default: 5	int
min_count	word2vec’s min_count option: prunes the dictionary of words that appear fewer than this number of times in the corpus. Default: 5	int
negative_samples	number of negative samples to include per positive. Default: 5	int
size	number of dimensions in learned vectors. Default: 200	int

This is an example of how this module can be used in a pipeline config file.

[my_word2vec_module]
type=pimlico.modules.embeddings.word2vec
input_text=module_a.some_output

This example usage includes more options.

[my_word2vec_module]
type=pimlico.modules.embeddings.word2vec
input_text=module_a.some_output
iters=5
min_count=5
negative_samples=5
size=200

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

word2vec_train