Random subsample¶
Path | pimlico.modules.corpora.subsample |
Executable | yes |
Randomly subsample documents of a corpus at a given rate to create a smaller corpus.
Inputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus |
Outputs¶
Name | Type(s) |
---|---|
corpus | corpus with data-point from input |
Options¶
Name | Description | Type |
---|---|---|
p | (required) Probability of including any given document. The resulting corpus will be roughly this proportion of the size of the input. Alternatively, you can specify an integer, which will be interpreted as the target size of the output. A p value will be calculated based on the size of the input corpus | float |
seed | Random seed. We always set a random seed before starting to ensure some level of reproducability | int |
skip_invalid | Skip over any invalid documents so that the output subset contains just valid document and no invalid ones. By default, invalid documents are passed through | bool |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
This example usage includes more options.
[my_subsample_module]
type=pimlico.modules.corpora.subsample
input_corpus=module_a.some_output
p=0.1
seed=1234
skip_invalid=T
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.