GloVe embedding trainer¶
Path | pimlico.modules.embeddings.glove |
Executable | yes |
Train GloVe embeddings on a tokenized corpus.
Uses the original GloVe code <https://github.com/stanfordnlp/GloVe>, called in a subprocess.
This module does not support Python 2, so can only be used when Pimlico is being run under Python 3
Inputs¶
Name | Type(s) |
---|---|
text | grouped_corpus <TokenizedDocumentType > |
Outputs¶
Name | Type(s) |
---|---|
embeddings | embeddings |
glove_output | named_file_collection |
Options¶
Name | Description | Type |
---|---|---|
alpha | Parameter in exponent of weighting function; default 0.75 | float |
array_size | Limit to length <array_size> the buffer which stores chunks of data to shuffle before writing to disk. This value overrides that which is automatically produced by ‘memory’ | int |
distance_weighting | If False, do not weight cooccurrence count by distance between words; if True (default), weight the cooccurrence count by inverse of distance between words | bool |
eta | Initial learning rate; default 0.05 | float |
grad_clip | Gradient components clipping parameter. Values will be clipped to [-grad-clip, grad-clip] interval | float |
iter | Number of training iterations; default 25 | int |
max_product | Limit the size of dense cooccurrence array by specifying the max product of the frequency counts of the two cooccurring words. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora. | int |
max_vocab | Upper bound on vocabulary size, i.e. keep the <max_vocab> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet. Default: 0 (no limit) | int |
memory | Soft limit for memory consumption, in GB – based on simple heuristic, so not extremely accurate; default 4.0 | float |
min_count | Lower limit such that words which occur fewer than <min_count> times are discarded. Default: 0 | int |
overflow_length | Limit to length the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora. | int |
seed | Random seed to use for shuffling. If not set, will be randomized using current time | int |
symmetric | If False, only use left context; if True (default), use left and right | bool |
threads | Number of threads during training; default 8 | int |
vector_size | Dimension of word vector representations (excluding bias term); default 50 | int |
window_size | Number of context words to the left (and to the right, if symmetric = 1); default 15 | int |
x_max | Parameter specifying cutoff in weighting function; default 100.0 | float |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output
This example usage includes more options.
[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output
alpha=0.75
array_size=0
distance_weighting=T
eta=0.05
grad_clip=0.1
iter=25
max_product=0
max_vocab=0
memory=4.00
min_count=0
overflow_length=0
seed=0
symmetric=T
threads=8
vector_size=50
window_size=15
x_max=100.00
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.