LDA trainer

Path pimlico.modules.gensim.lda
Executable yes

Trains LDA using Gensim’s basic LDA implementation, or the multicore version.

Does not support Python 2, since Gensim has dropped Python 2 support.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs

Name Type(s)
corpus grouped_corpus <IntegerListsDocumentType>
vocab dictionary

Outputs

Name Type(s)
model lda_model

Options

Name Description Type
alpha Alpha prior over topic distribution. May be one of special values ‘symmetric’, ‘asymmetric’ and ‘auto’, or a single float, or a list of floats. Default: symmetric ‘symmetric’, ‘asymmetric’, ‘auto’ or a float
chunksize Model’s chunksize parameter. Chunk size to use for distributed/multicore computing. Default: 2000 int
decay Decay parameter. Default: 0.5 float
distributed Turn on distributed computing. Default: False. Ignored by multicore implementation bool
eta Eta prior of word distribution. May be one of special values ‘auto’ and ‘symmetric’, or a float. Default: symmetric ‘symmetric’, ‘auto’ or a float
eval_every   int
gamma_threshold   float
ignore_terms Ignore any of these terms in the bags of words when iterating over the corpus to train the model. Typically, you’ll want to include an OOV term here if your corpus has one, and any other special terms that are not part of a document’s content comma-separated list of strings
iterations Max number of iterations in each update. Default: 50 int
minimum_phi_value   float
minimum_probability   float
multicore Use Gensim’s multicore implementation of LDA training (gensim.models.ldamulticore). Default is to use gensim.models.ldamodel. Number of cores used for training set by Pimlico’s processes parameter bool
num_topics Number of topics for the trained model to have. Default: 100 int
offset Offset parameter. Default: 1.0 float
passes Passes parameter. Default: 1 int
tfidf Transform word counts using TF-IDF when presenting documents to the model for training. Default: False bool
update_every Model’s update_every parameter. Default: 1. Ignored by multicore implementation int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
alpha=symmetric
chunksize=2000
decay=0.50
distributed=F
eta=symmetric
eval_every=10
gamma_threshold=0.00
ignore_terms=
iterations=50
minimum_phi_value=0.01
minimum_probability=0.01
multicore=F
num_topics=100
offset=1.00
passes=1
tfidf=F
update_every=1

Example pipelines

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.