LDA trainer¶
Path | pimlico.modules.gensim.lda |
Executable | yes |
Trains LDA using Gensim’s basic LDA implementation, or the multicore version.
Does not support Python 2, since Gensim has dropped Python 2 support.
This module does not support Python 2, so can only be used when Pimlico is being run under Python 3
Inputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <IntegerListsDocumentType > |
vocab | dictionary |
Options¶
Name | Description | Type |
---|---|---|
alpha | Alpha prior over topic distribution. May be one of special values ‘symmetric’, ‘asymmetric’ and ‘auto’, or a single float, or a list of floats. Default: symmetric | ‘symmetric’, ‘asymmetric’, ‘auto’ or a float |
chunksize | Model’s chunksize parameter. Chunk size to use for distributed/multicore computing. Default: 2000 | int |
decay | Decay parameter. Default: 0.5 | float |
distributed | Turn on distributed computing. Default: False. Ignored by multicore implementation | bool |
eta | Eta prior of word distribution. May be one of special values ‘auto’ and ‘symmetric’, or a float. Default: symmetric | ‘symmetric’, ‘auto’ or a float |
eval_every | int | |
gamma_threshold | float | |
ignore_terms | Ignore any of these terms in the bags of words when iterating over the corpus to train the model. Typically, you’ll want to include an OOV term here if your corpus has one, and any other special terms that are not part of a document’s content | comma-separated list of strings |
iterations | Max number of iterations in each update. Default: 50 | int |
minimum_phi_value | float | |
minimum_probability | float | |
multicore | Use Gensim’s multicore implementation of LDA training (gensim.models.ldamulticore). Default is to use gensim.models.ldamodel. Number of cores used for training set by Pimlico’s processes parameter | bool |
num_topics | Number of topics for the trained model to have. Default: 100 | int |
offset | Offset parameter. Default: 1.0 | float |
passes | Passes parameter. Default: 1 | int |
tfidf | Transform word counts using TF-IDF when presenting documents to the model for training. Default: False | bool |
update_every | Model’s update_every parameter. Default: 1. Ignored by multicore implementation | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
This example usage includes more options.
[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
alpha=symmetric
chunksize=2000
decay=0.50
distributed=F
eta=symmetric
eval_every=10
gamma_threshold=0.00
ignore_terms=
iterations=50
minimum_phi_value=0.01
minimum_probability=0.01
multicore=F
num_topics=100
offset=1.00
passes=1
tfidf=F
update_every=1
Example pipelines¶
This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.