fastText embedding trainer¶
Path | pimlico.modules.embeddings.fasttext |
Executable | yes |
Train fastText embeddings on a tokenized corpus.
Uses the fastText Python package <https://fasttext.cc/docs/en/python-module.html>.
FastText embeddings store more than just a vector for each word, since they also have sub-word representations. We therefore store a standard embeddings output, with the word vectors in, and also a special fastText embeddings output.
This module does not support Python 2, so can only be used when Pimlico is being run under Python 3
Inputs¶
Name | Type(s) |
---|---|
text | grouped_corpus <TokenizedDocumentType > |
Outputs¶
Name | Type(s) |
---|---|
embeddings | embeddings |
model | fasttext_embeddings |
Options¶
Name | Description | Type |
---|---|---|
bucket | number of buckets. Default: 2,000,000 | int |
dim | size of word vectors. Default: 100 | int |
epoch | number of epochs. Default: 5 | int |
loss | loss function: ns, hs, softmax, ova. Default: ns | ‘ns’, ‘hs’, ‘softmax’ or ‘ova’ |
lr | learning rate. Default: 0.05 | float |
lr_update_rate | change the rate of updates for the learning rate. Default: 100 | int |
maxn | max length of char ngram. Default: 6 | int |
min_count | minimal number of word occurences. Default: 5 | int |
minn | min length of char ngram. Default: 3 | int |
model | unsupervised fasttext model: cbow, skipgram. Default: skipgram | ‘skipgram’ or ‘cbow’ |
neg | number of negatives sampled. Default: 5 | int |
t | sampling threshold. Default: 0.0001 | float |
verbose | verbose. Default: 2 | int |
word_ngrams | max length of word ngram. Default: 1 | int |
ws | size of the context window. Default: 5 | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output
This example usage includes more options.
[my_fasttext_module]
type=pimlico.modules.embeddings.fasttext
input_text=module_a.some_output
bucket=2000000
dim=100
epoch=5
loss=ns
lr=0.05
lr_update_rate=100
maxn=6
min_count=5
minn=3
model=skipgram
neg=5
t=0.00
verbose=2
word_ngrams=1
ws=5
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.