Random subsample¶

Path	pimlico.modules.corpora.subsample
Executable	yes

Randomly subsample documents of a corpus at a given rate to create a smaller corpus.

Inputs¶

Name	Type(s)
corpus	`grouped_corpus`

Name	Type(s)
corpus	`corpus with data-point from input`

Name	Description	Type
p	(required) Probability of including any given document. The resulting corpus will be roughly this proportion of the size of the input. Alternatively, you can specify an integer, which will be interpreted as the target size of the output. A p value will be calculated based on the size of the input corpus	float
seed	Random seed. We always set a random seed before starting to ensure some level of reproducability	int
skip_invalid	Skip over any invalid documents so that the output subset contains just valid document and no invalid ones. By default, invalid documents are passed through	bool

This is an example of how this module can be used in a pipeline config file.

[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1

This example usage includes more options.

[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
seed=1234
skip_invalid=T

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

subsample