Topic model topic coherence¶

Path	pimlico.modules.gensim.coherence
Executable	yes

Compute topic coherence.

Takes input as a list of the top words for each topic. This can be produced from various types of topic model, so they can all be evaluated using this method.

Also requires a corpus from which to compute the PMI statistics. This should typically be a different corpus to that on which the model was trained.

For now, this just computes statistics and outputs them to a text file, and also outputs a single number representing the mean topic coherence across topics.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs¶

Name	Type(s)
topics_top_words	`topics_top_words`
corpus	`grouped_corpus` <`TokenizedDocumentType`>
vocab	`dictionary`

Outputs¶

Name	Type(s)
output	`named_file`
mean_coherence	`numeric_result`

Options¶

Name	Description	Type
coherence	Coherence measure to use, selecting from one of Gensim’s pre-defined measures: ‘u_mass’, ‘c_v’, ‘c_uci’, ‘c_npmi’. Default: ‘u_mass’	‘u_mass’, ‘c_v’, ‘c_uci’ or ‘c_npmi’
window_size	Size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None, the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10.	int

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_topic_coherence_module]
type=pimlico.modules.gensim.coherence
input_topics_top_words=module_a.some_output
input_corpus=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_topic_coherence_module]
type=pimlico.modules.gensim.coherence
input_topics_top_words=module_a.some_output
input_corpus=module_a.some_output
input_vocab=module_a.some_output
coherence=u_mass
window_size=0

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

lda_coherence