Text corpus directory¶

Path	pimlico.modules.output.text_corpus
Executable	yes

Output module for producing a directory containing a text corpus, with documents stored in separate files.

The input must be a raw text grouped corpus. Corpora with other document types can be converted to raw text using the format module.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus` <`RawTextDocumentType`>

Outputs¶

No outputs

Options¶

Name	Description	Type
archive_dirs	Create a subdirectory for each archive of the grouped corpus to store that archive’s documents in. Otherwise, all documents are stored in the same directory (or subdirectories where the document names include directory separators)	bool
invalid	What to do with invalid documents (where there’s been a problem reading/processing the document somewhere in the pipeline). ‘skip’ (default): don’t output the document at all. ‘empty’: output an empty file	‘skip’ or ‘empty’
path	(required) Directory to write the corpus to	string
suffix	Suffix to use for each document’s filename	string
tar	Add all files to a single tar archive, instead of just outputting to disk in the given directory. This is a good choice for very large corpora, for which storing to files on disk can cause filesystem problems. If given, the value is used as the basename for the tar archive. Default: do not output tar	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_text_corpus_module]
type=pimlico.modules.output.text_corpus
input_corpus=module_a.some_output
path=value

This example usage includes more options.

[my_text_corpus_module]
type=pimlico.modules.output.text_corpus
input_corpus=module_a.some_output
archive_dirs=T
invalid=skip
path=value
suffix=value
tar=value