Corpus subset

Path pimlico.modules.corpora.subset
Executable no

Simple filter to truncate a dataset after a given number of documents, potentially offsetting by a number of documents. Mainly useful for creating small subsets of a corpus for testing a pipeline before running on the full corpus.

Can be run on an iterable corpus or a tarred corpus. If the input is a tarred corpus, the filter will emulate a tarred corpus with the appropriate datatype, passing through the archive names from the input.

When a number of valid documents is required (calculating corpus length when skipping invalid docs), if one is stored in the metadata as valid_documents, that count is used instead of iterating over the data to count them up.

This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.

Inputs

Name Type(s)
corpus iterable_corpus

Outputs

Name Type(s)
corpus corpus with data-point from input

Options

Name Description Type
offset Number of documents to skip at the beginning of the corpus (default: 0, start at beginning) int
size (required) Number of documents to include int
skip_invalid Skip over any invalid documents so that the output subset contains the chosen number of (valid) documents (or as many as possible) and no invalid ones. By default, invalid documents are passed through and counted towards the subset size bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_subset_module]
type=pimlico.modules.corpora.subset
input_corpus=module_a.some_output
size=100

This example usage includes more options.

[my_subset_module]
type=pimlico.modules.corpora.subset
input_corpus=module_a.some_output
offset=0
size=100
skip_invalid=T

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.