Word2vec embedding trainer¶
| Path | pimlico.modules.embeddings.word2vec |
| Executable | yes |
Word2vec embedding learning algorithm, using Gensim’s implementation.
Find out more about word2vec.
This module is simply a wrapper to call Gensim Python (+C)’s implementation of word2vec on a Pimlico corpus.
Inputs¶
| Name | Type(s) |
|---|---|
| text | grouped_corpus <TokenizedDocumentType> |
Outputs¶
| Name | Type(s) |
|---|---|
| model | embeddings |
Options¶
| Name | Description | Type |
|---|---|---|
| iters | number of iterations over the data to perform. Default: 5 | int |
| min_count | word2vec’s min_count option: prunes the dictionary of words that appear fewer than this number of times in the corpus. Default: 5 | int |
| negative_samples | number of negative samples to include per positive. Default: 5 | int |
| size | number of dimensions in learned vectors. Default: 200 | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_word2vec_module]
type=pimlico.modules.embeddings.word2vec
input_text=module_a.some_output
This example usage includes more options.
[my_word2vec_module]
type=pimlico.modules.embeddings.word2vec
input_text=module_a.some_output
iters=5
min_count=5
negative_samples=5
size=200
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.