LDA trainer¶
Path | pimlico.modules.gensim.lda |
Executable | yes |
Trains LDA using Gensim’s basic LDA implementation, or the multicore version.
Todo
Add test pipeline and test
Inputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <IntegerListsDocumentType > |
vocab | dictionary |
Outputs¶
Name | Type(s) |
---|---|
model | lda_model |
Options¶
Name | Description | Type |
---|---|---|
alpha | Alpha prior over topic distribution. May be one of special values ‘symmetric’, ‘asymmetric’ and ‘auto’, or a single float, or a list of floats. Default: symmetric | ‘symmetric’, ‘asymmetric’, ‘auto’ or a float |
chunksize | Model’s chunksize parameter. Chunk size to use for distributed/multicore computing. Default: 2000 | int |
decay | Decay parameter. Default: 0.5 | float |
distributed | Turn on distributed computing. Default: False. Ignored by multicore implementation | bool |
eta | Eta prior of word distribution. May be one of special values ‘auto’ and ‘symmetric’, or a float. Default: symmetric | ‘symmetric’, ‘auto’ or a float |
eval_every | int | |
gamma_threshold | float | |
ignore_terms | Ignore any of these terms in the bags of words when iterating over the corpus to train the model. Typically, you’ll want to include an OOV term here if your corpus has one, and any other special terms that are not part of a document’s content | comma-separated list of strings |
iterations | Max number of iterations in each update. Default: 50 | int |
minimum_phi_value | float | |
minimum_probability | float | |
multicore | Use Gensim’s multicore implementation of LDA training (gensim.models.ldamulticore). Default is to use gensim.models.ldamodel. Number of cores used for training set by Pimlico’s processes parameter | bool |
num_topics | Number of topics for the trained model to have. Default: 100 | int |
offset | Offset parameter. Default: 1.0 | float |
passes | Passes parameter. Default: 1 | int |
tfidf | Transform word counts using TF-IDF when presenting documents to the model for training. Default: False | bool |
update_every | Model’s update_every parameter. Default: 1. Ignored by multicore implementation | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
This example usage includes more options.
[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
alpha=symmetric
chunksize=2000
decay=0.50
distributed=F
eta=symmetric
eval_every=10
gamma_threshold=0.00
ignore_terms=
iterations=50
minimum_phi_value=0.01
minimum_probability=0.01
multicore=F
num_topics=100
offset=1.00
passes=1
tfidf=F
update_every=1