Text corpus directory

Path pimlico.modules.output.text_corpus
Executable yes

Output module for producing a directory containing a text corpus, with documents stored in separate files.

The input must be a raw text grouped corpus. Corpora with other document types can be converted to raw text using the format module.

Inputs

Name Type(s)
corpus grouped_corpus <RawTextDocumentType>

Outputs

No outputs

Options

Name Description Type
archive_dirs Create a subdirectory for each archive of the grouped corpus to store that archive’s documents in. Otherwise, all documents are stored in the same directory (or subdirectories where the document names include directory separators) bool
invalid What to do with invalid documents (where there’s been a problem reading/processing the document somewhere in the pipeline). ‘skip’ (default): don’t output the document at all. ‘empty’: output an empty file ‘skip’ or ‘empty’
path (required) Directory to write the corpus to string
suffix Suffix to use for each document’s filename string
tar Add all files to a single tar archive, instead of just outputting to disk in the given directory. This is a good choice for very large corpora, for which storing to files on disk can cause filesystem problems. If given, the value is used as the basename for the tar archive. Default: do not output tar string

Example config

This is an example of how this module can be used in a pipeline config file.

[my_text_corpus_module]
type=pimlico.modules.output.text_corpus
input_corpus=module_a.some_output
path=value

This example usage includes more options.

[my_text_corpus_module]
type=pimlico.modules.output.text_corpus
input_corpus=module_a.some_output
archive_dirs=T
invalid=skip
path=value
suffix=value
tar=value