Archive grouper (filter)

Path pimlico.modules.corpora.group
Executable no

Group the data points (documents) of an iterable corpus into fixed-size archives. This is a standard thing to do at the start of the pipeline, since it’s a handy way to store many (potentially small) files without running into filesystem problems.

The documents are simply grouped linearly into a series of groups (archives) such that each (apart from the last) contains the given number of documents.

After grouping documents in this way, document map modules can be called on the corpus and the grouping will be preserved as the corpus passes through the pipeline.

Note

This module used to be called tar_filter, but has been renamed in keeping with other changes in the new datatype system.

There also used to be a tar module that wrote the grouped corpus to disk. This has now been removed, since most of the time it’s fine to use this filter module instead. If you really want to store the grouped corpus, you can use the store module.

This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

Inputs

Name Type(s)
documents iterable_corpus

Outputs

Name Type(s)
documents grouped corpus with input doc type

Options

Name Description Type
archive_basename Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) string
archive_size Number of documents to include in each archive (default: 1k) int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_group_module]
type=pimlico.modules.corpora.group
input_documents=module_a.some_output

This example usage includes more options.

[my_group_module]
type=pimlico.modules.corpora.group
input_documents=module_a.some_output
archive_basename=archive
archive_size=1000

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.