Interleaved corpora¶

Path	pimlico.modules.corpora.interleave
Executable	no

Interleave data points from two (or more) corpora to produce a bigger corpus.

Similar to concat, but interleaves the documents when iterating. Preserves the order of documents within corpora and takes documents two each corpus in inverse proportion to its length, i.e. spreads out a smaller corpus so we don’t finish iterating over it earlier than the longer one.

They must have the same data point type, or one must be a subtype of the other.

In theory, we could find the most specific common ancestor and use that as the output type, but this is not currently implemented and may not be worth the trouble. Perhaps we will add this in future.

This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

Inputs¶

Name	Type(s)
corpora	`list` of `grouped_corpus`

Outputs¶

Name	Type(s)
corpus	`grouped corpus with input doc type`

Options¶

Name	Description	Type
archive_basename	Documents are regrouped into new archives. Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’)	string
archive_size	Documents are regrouped into new archives. Number of documents to include in each archive (default: 1k)	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_interleave_module]
type=pimlico.modules.corpora.interleave
input_corpora=module_a.some_output

This example usage includes more options.

[my_interleave_module]
type=pimlico.modules.corpora.interleave
input_corpora=module_a.some_output
archive_basename=archive
archive_size=1000

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

interleave