Archive grouper (filter)¶
Path | pimlico.modules.corpora.group |
Executable | no |
Group the data points (documents) of an iterable corpus into fixed-size archives. This is a standard thing to do at the start of the pipeline, since it’s a handy way to store many (potentially small) files without running into filesystem problems.
The documents are simply grouped linearly into a series of groups (archives) such that each (apart from the last) contains the given number of documents.
After grouping documents in this way, document map modules can be called on the corpus and the grouping will be preserved as the corpus passes through the pipeline.
Note
This module used to be called tar_filter
, but has been renamed in keeping
with other changes in the new datatype system.
There also used to be a tar
module that wrote the grouped corpus to disk.
This has now been removed, since most of the time it’s fine to use this
filter module instead. If you really want to store the grouped corpus, you
can use the store
module.
This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
Inputs¶
Name | Type(s) |
---|---|
documents | iterable_corpus |
Outputs¶
Name | Type(s) |
---|---|
documents | grouped corpus with input doc type |
Options¶
Name | Description | Type |
---|---|---|
archive_basename | Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) | string |
archive_size | Number of documents to include in each archive (default: 1k) | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_group_module]
type=pimlico.modules.corpora.group
input_documents=module_a.some_output
This example usage includes more options.
[my_group_module]
type=pimlico.modules.corpora.group
input_documents=module_a.some_output
archive_basename=archive
archive_size=1000
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.