LDA trainer¶

Path	pimlico.modules.gensim.lda
Executable	yes

Trains LDA using Gensim’s basic LDA implementation, or the multicore version.

Todo

Add test pipeline and test

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`IntegerListsDocumentType`>
vocab	`dictionary`

Outputs¶

Name	Type(s)
model	`lda_model`

Options¶

Name	Description	Type
alpha	Alpha prior over topic distribution. May be one of special values ‘symmetric’, ‘asymmetric’ and ‘auto’, or a single float, or a list of floats. Default: symmetric	‘symmetric’, ‘asymmetric’, ‘auto’ or a float
chunksize	Model’s chunksize parameter. Chunk size to use for distributed/multicore computing. Default: 2000	int
decay	Decay parameter. Default: 0.5	float
distributed	Turn on distributed computing. Default: False. Ignored by multicore implementation	bool
eta	Eta prior of word distribution. May be one of special values ‘auto’ and ‘symmetric’, or a float. Default: symmetric	‘symmetric’, ‘auto’ or a float
eval_every		int
gamma_threshold		float
ignore_terms	Ignore any of these terms in the bags of words when iterating over the corpus to train the model. Typically, you’ll want to include an OOV term here if your corpus has one, and any other special terms that are not part of a document’s content	comma-separated list of strings
iterations	Max number of iterations in each update. Default: 50	int
minimum_phi_value		float
minimum_probability		float
multicore	Use Gensim’s multicore implementation of LDA training (gensim.models.ldamulticore). Default is to use gensim.models.ldamodel. Number of cores used for training set by Pimlico’s processes parameter	bool
num_topics	Number of topics for the trained model to have. Default: 100	int
offset	Offset parameter. Default: 1.0	float
passes	Passes parameter. Default: 1	int
tfidf	Transform word counts using TF-IDF when presenting documents to the model for training. Default: False	bool
update_every	Model’s update_every parameter. Default: 1. Ignored by multicore implementation	int

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_lda_trainer_module]
type=pimlico.modules.gensim.lda
input_corpus=module_a.some_output
input_vocab=module_a.some_output
alpha=symmetric
chunksize=2000
decay=0.50
distributed=F
eta=symmetric
eval_every=10
gamma_threshold=0.00
ignore_terms=
iterations=50
minimum_phi_value=0.01
minimum_probability=0.01
multicore=F
num_topics=100
offset=1.00
passes=1
tfidf=F
update_every=1