Interleaved corpora

Path pimlico.modules.corpora.interleave
Executable no

Interleave data points from two (or more) corpora to produce a bigger corpus.

Similar to concat, but interleaves the documents when iterating. Preserves the order of documents within corpora and takes documents two each corpus in inverse proportion to its length, i.e. spreads out a smaller corpus so we don’t finish iterating over it earlier than the longer one.

They must have the same data point type, or one must be a subtype of the other.

In theory, we could find the most specific common ancestor and use that as the output type, but this is not currently implemented and may not be worth the trouble. Perhaps we will add this in future.

This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

Inputs

Name Type(s)
corpora list of grouped_corpus

Outputs

Name Type(s)
corpus grouped corpus with input doc type

Options

Name Description Type
archive_basename Documents are regrouped into new archives. Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) string
archive_size Documents are regrouped into new archives. Number of documents to include in each archive (default: 1k) string

Example config

This is an example of how this module can be used in a pipeline config file.

[my_interleave_module]
type=pimlico.modules.corpora.interleave
input_corpora=module_a.some_output

This example usage includes more options.

[my_interleave_module]
type=pimlico.modules.corpora.interleave
input_corpora=module_a.some_output
archive_basename=archive
archive_size=1000

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.