Corpus subset
Path |
pimlico.modules.corpora.subset |
Executable |
no |
Simple filter to truncate a dataset after a given number of documents, potentially offsetting by a number
of documents. Mainly useful for creating small subsets of a corpus for testing a pipeline before running
on the full corpus.
Can be run on an iterable corpus or a tarred corpus. If the input is a tarred corpus, the filter will
emulate a tarred corpus with the appropriate datatype, passing through the archive names from the input.
This is a filter module. It is not executable, so won’t appear in a pipeline’s list of modules that can be run. It produces its output for the next module on the fly when the next module needs it.
Outputs
Name |
Type(s) |
documents |
corpus with data-point from input |
Options
Name |
Description |
Type |
offset |
Number of documents to skip at the beginning of the corpus (default: 0, start at beginning) |
int |
skip_invalid |
Skip over any invalid documents so that the output subset contains the chosen number of (valid) documents (or as many as possible) and no invalid ones. By default, invalid documents are passed through and counted towards the subset size |
bool |
size |
(required) |
int |