Corpus statistics¶
Path | pimlico.modules.corpora.corpus_stats |
Executable | yes |
Some basic statistics about tokenized corpora
Counts the number of tokens, sentences and distinct tokens in a corpus.
Inputs¶
Name | Type(s) |
---|---|
corpus | TarredCorpus<TokenizedDocumentType> |
Outputs¶
Name | Type(s) |
---|---|
stats | NamedFile() |