GloVe embedding trainer¶

Path	pimlico.modules.embeddings.glove
Executable	yes

Train GloVe embeddings on a tokenized corpus.

Uses the original GloVe code <https://github.com/stanfordnlp/GloVe>, called in a subprocess.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`TokenizedDocumentType`>

Outputs¶

Name	Type(s)
embeddings	`embeddings`
glove_output	`named_file_collection`

Options¶

Name	Description	Type
alpha	Parameter in exponent of weighting function; default 0.75	float
array_size	Limit to length <array_size> the buffer which stores chunks of data to shuffle before writing to disk. This value overrides that which is automatically produced by ‘memory’	int
distance_weighting	If False, do not weight cooccurrence count by distance between words; if True (default), weight the cooccurrence count by inverse of distance between words	bool
eta	Initial learning rate; default 0.05	float
grad_clip	Gradient components clipping parameter. Values will be clipped to [-grad-clip, grad-clip] interval	float
iter	Number of training iterations; default 25	int
max_product	Limit the size of dense cooccurrence array by specifying the max product of the frequency counts of the two cooccurring words. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora.	int
max_vocab	Upper bound on vocabulary size, i.e. keep the <max_vocab> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet. Default: 0 (no limit)	int
memory	Soft limit for memory consumption, in GB – based on simple heuristic, so not extremely accurate; default 4.0	float
min_count	Lower limit such that words which occur fewer than <min_count> times are discarded. Default: 0	int
overflow_length	Limit to length the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. This value overrides that which is automatically produced by ‘memory’. Typically only needs adjustment for use with very large corpora.	int
seed	Random seed to use for shuffling. If not set, will be randomized using current time	int
symmetric	If False, only use left context; if True (default), use left and right	bool
threads	Number of threads during training; default 8	int
vector_size	Dimension of word vector representations (excluding bias term); default 50	int
window_size	Number of context words to the left (and to the right, if symmetric = 1); default 15	int
x_max	Parameter specifying cutoff in weighting function; default 100.0	float

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output

This example usage includes more options.

[my_glove_module]
type=pimlico.modules.embeddings.glove
input_text=module_a.some_output
alpha=0.75
array_size=0
distance_weighting=T
eta=0.05
grad_clip=0.1
iter=25
max_product=0
max_vocab=0
memory=4.00
min_count=0
overflow_length=0
seed=0
symmetric=T
threads=8
vector_size=50
window_size=15
x_max=100.00

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

glove_train