Corpus statistics

Path pimlico.modules.corpora.corpus_stats
Executable yes

Some basic statistics about tokenized corpora

Counts the number of tokens, sentences and distinct tokens in a corpus.

Inputs

Name Type(s)
corpus grouped_corpus <TokenizedDocumentType>

Outputs

Name Type(s)
stats named_file

Example config

This is an example of how this module can be used in a pipeline config file.

[my_corpus_stats_module]
type=pimlico.modules.corpora.corpus_stats
input_corpus=module_a.some_output

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.