LDA-seq (DTM) document topic analysis

Path pimlico.modules.gensim.ldaseq_doc_topics
Executable yes

Takes a trained DTM model and produces the topic vector for every document in a corpus.

The corpus is given as integer lists documents, which are the integer IDs of the words in each sentence of each document. It is assumed that the corpus uses the same vocabulary to map to integer IDs as the LDA model’s training corpus, so no further mapping needs to be done.

We also require a corpus of labels to say what time slice each document is in. These should be from the same set of labels that the DTM model was trained on, so that each document label can be mapped to a trained slice.

Does not support Python 2 since Gensim has dropped Python 2 support.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs

Name Type(s)
corpus grouped_corpus <IntegerListsDocumentType>
labels grouped_corpus <LabelDocumentType>
model ldaseq_model

Outputs

Name Type(s)
vectors grouped_corpus <VectorDocumentType>

Example config

This is an example of how this module can be used in a pipeline config file.

[my_ldaseq_doc_topics_module]
type=pimlico.modules.gensim.ldaseq_doc_topics
input_corpus=module_a.some_output
input_labels=module_a.some_output
input_model=module_a.some_output

Example pipelines

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.