Logo
  • Pimlico guides
  • Core docs
  • Core Pimlico modules
    • Corpus manipulation
      • Corpus concatenation
      • Corpus statistics
      • Human-readable formatting
      • Archive grouper (filter)
      • Interleaved corpora
      • Corpus document list filter
      • Random shuffle
      • Random shuffle (linear)
      • Corpus split
      • Store a corpus
      • Random subsample
      • Corpus subset
      • Corpus vocab builder
      • Token frequency counter
      • Tokenized corpus to ID mapper
      • ID to tokenized corpus mapper
    • Embeddings
    • Gensim topic modelling
    • Input readers
    • Malt dependency parser
    • NLTK
    • OpenNLP tools
    • Output modules
    • Scikit-learn tools
    • spaCy
    • Document-level text filters
    • General utilities
    • Visualization tools
  • Command-line interface
  • API Documentation
  • Module test pipelines
  • Example pipelines
  • Future plans
Pimlico
  • Docs »
  • Core Pimlico modules »
  • Corpus manipulation
  • Edit on GitHub

Corpus manipulationΒΆ

Core modules for generic manipulation of mainly iterable corpora.

  • Corpus concatenation
  • Corpus statistics
  • Human-readable formatting
  • Archive grouper (filter)
  • Interleaved corpora
  • Corpus document list filter
  • Random shuffle
  • Random shuffle (linear)
  • Corpus split
  • Store a corpus
  • Random subsample
  • Corpus subset
  • Corpus vocab builder
  • Token frequency counter
  • Tokenized corpus to ID mapper
  • ID to tokenized corpus mapper
Next Previous

© Copyright 2016, Mark Granroth-Wilding Revision 2a6e05f5.

Built with Sphinx using a theme provided by Read the Docs.