GloVe embedding trainer

Path pimlico.modules.embeddings.glove
Executable yes

Train GloVe embeddings on a tokenized corpus.

Uses the original GloVe code <https://github.com/stanfordnlp/GloVe>, called in a subprocess.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs

Name Type(s)
text grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
embeddings embeddings
glove_output named_file_collection

Options

Name Description Type
alpha Parameter in exponent of weighting function; default 0.75 float
array_size Limit to length <array_size> the buffer which stores chunks of data to shuffle before writing to disk. This value overrides that which is automatically produced by ‘memory’ int
distance_weighting If False, do not weight cooccurrence count by distance between words; if True (default), weight the cooccurrence count by inverse of distance between words bool
eta Initial learning rate; default 0.05 float
grad_clip Gradient components clipping parameter. Values will be clipped to [-grad-clip, grad-clip] interval float
iter Number of training iterations; default 25 int
max_product Limit the size of dense cooccurrence array by specifying the max product of the frequency counts of the two cooccurring words. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora. int
max_vocab Upper bound on vocabulary size, i.e. keep the <max_vocab> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet. Default: 0 (no limit) int
memory Soft limit for memory consumption, in GB – based on simple heuristic, so not extremely accurate; default 4.0 float
min_count Lower limit such that words which occur fewer than <min_count> times are discarded. Default: 0 int
overflow_length Limit to length the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora. int
seed Random seed to use for shuffling. If not set, will be randomized using current time int
symmetric If False, only use left context; if True (default), use left and right bool
threads Number of threads during training; default 8 int
vector_size Dimension of word vector representations (excluding bias term); default 50 int
window_size Number of context words to the left (and to the right, if symmetric = 1); default 15 int
x_max Parameter specifying cutoff in weighting function; default 100.0 float

Example config

This is an example of how this module can be used in a pipeline config file.

[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output

This example usage includes more options.

[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output
alpha=0.75
array_size=0
distance_weighting=T
eta=0.05
grad_clip=0.1
iter=25
max_product=0
max_vocab=0
memory=4.00
min_count=0
overflow_length=0
seed=0
symmetric=T
threads=8
vector_size=50
window_size=15
x_max=100.00

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.