Random subsample

Path pimlico.modules.corpora.subsample
Executable yes

Randomly subsample documents of a corpus at a given rate to create a smaller corpus.

Inputs

Name Type(s)
corpus grouped_corpus

Outputs

Name Type(s)
corpus corpus with data-point from input

Options

Name Description Type
p (required) Probability of including any given document. The resulting corpus will be roughly this proportion of the size of the input. Alternatively, you can specify an integer, which will be interpreted as the target size of the output. A p value will be calculated based on the size of the input corpus float
seed Random seed. We always set a random seed before starting to ensure some level of reproducability int
skip_invalid Skip over any invalid documents so that the output subset contains just valid document and no invalid ones. By default, invalid documents are passed through bool

Example config

This is an example of how this module can be used in a pipeline config file.

[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1

This example usage includes more options.

[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
seed=1234
skip_invalid=T

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.