Corpus statistics¶
| Path | pimlico.modules.corpora.corpus_stats |
| Executable | yes |
Some basic statistics about tokenized corpora
Counts the number of tokens, sentences and distinct tokens in a corpus.
Inputs¶
| Name | Type(s) |
|---|---|
| corpus | grouped_corpus <TokenizedDocumentType> |
Outputs¶
| Name | Type(s) |
|---|---|
| stats | named_file |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_corpus_stats_module]
type=pimlico.modules.corpora.corpus_stats
input_corpus=module_a.some_output
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.