Tar archive grouper

Path pimlico.modules.corpora.tar
Executable yes

Group the files of a multi-file iterable corpus into tar archives. This is a standard thing to do at the start of the pipeline, since it’s a handy way to store many (potentially small) files without running into filesystem problems.

The files are simply grouped linearly into a series of tar archives such that each (apart from the last) contains the given number.

After grouping documents in this way, document map modules can be called on the corpus and the grouping will be preserved as the corpus passes through the pipeline.

Note

There is a fundamental problem with this module. It stores the raw data that it gets as input, and reports the output type as the same as the input type. However, it doesn’t correctly write that type. A lot of the time, this isn’t a problem, but it means that it doesn’t write corpus metadata that may be needed by the datatype to read the documents correctly.

The new datatypes system will provide a solution to this problem, but until then the safest approach is not to use this module, but always use tar_filter instead, which doesn’t have this problem.

Inputs

Name Type(s)
documents IterableCorpus

Outputs

Name Type(s)
documents tarred corpus with input doc type

Options

Name Description Type
archive_size Number of documents to include in each archive (default: 1k) string
archive_basename Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) string