Topic model topic coherence

Path pimlico.modules.gensim.coherence
Executable yes

Compute topic coherence.

Takes input as a list of the top words for each topic. This can be produced from various types of topic model, so they can all be evaluated using this method.

Also requires a corpus from which to compute the PMI statistics. This should typically be a different corpus to that on which the model was trained.

For now, this just computes statistics and outputs them to a text file, and also outputs a single number representing the mean topic coherence across topics.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs

Name Type(s)
topics_top_words topics_top_words
corpus grouped_corpus <TokenizedDocumentType>
vocab dictionary

Outputs

Name Type(s)
output named_file
mean_coherence numeric_result

Options

Name Description Type
coherence Coherence measure to use, selecting from one of Gensim’s pre-defined measures: ‘u_mass’, ‘c_v’, ‘c_uci’, ‘c_npmi’. Default: ‘u_mass’ ‘u_mass’, ‘c_v’, ‘c_uci’ or ‘c_npmi’
window_size Size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None, the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10. int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_topic_coherence_module]
type=pimlico.modules.gensim.coherence
input_topics_top_words=module_a.some_output
input_corpus=module_a.some_output
input_vocab=module_a.some_output

This example usage includes more options.

[my_topic_coherence_module]
type=pimlico.modules.gensim.coherence
input_topics_top_words=module_a.some_output
input_corpus=module_a.some_output
input_vocab=module_a.some_output
coherence=u_mass
window_size=0

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.