fastText embedding trainer

Path pimlico.modules.embeddings.fasttext
Executable yes

Train fastText embeddings on a tokenized corpus.

Uses the fastText Python package <https://fasttext.cc/docs/en/python-module.html>.

FastText embeddings store more than just a vector for each word, since they also have sub-word representations. We therefore store a standard embeddings output, with the word vectors in, and also a special fastText embeddings output.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs

Name Type(s)
text grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
embeddings embeddings
model fasttext_embeddings

Options

Name Description Type
bucket number of buckets. Default: 2,000,000 int
dim size of word vectors. Default: 100 int
epoch number of epochs. Default: 5 int
loss loss function: ns, hs, softmax, ova. Default: ns ‘ns’, ‘hs’, ‘softmax’ or ‘ova’
lr learning rate. Default: 0.05 float
lr_update_rate change the rate of updates for the learning rate. Default: 100 int
maxn max length of char ngram. Default: 6 int
min_count minimal number of word occurences. Default: 5 int
minn min length of char ngram. Default: 3 int
model unsupervised fasttext model: cbow, skipgram. Default: skipgram ‘skipgram’ or ‘cbow’
neg number of negatives sampled. Default: 5 int
t sampling threshold. Default: 0.0001 float
verbose verbose. Default: 2 int
word_ngrams max length of word ngram. Default: 1 int
ws size of the context window. Default: 5 int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output

This example usage includes more options.

[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output
bucket=2000000
dim=100
epoch=5
loss=ns
lr=0.05
lr_update_rate=100
maxn=6
min_count=5
minn=3
model=skipgram
neg=5
t=0.00
verbose=2
word_ngrams=1
ws=5

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.