Corpus document list filter

Path pimlico.modules.corpora.list_filter
Executable yes

Similar to split, but instead of taking a random split of the dataset, splits it according to a given list of documents, putting those in the list in one set and the rest in another.

Inputs

Name Type(s)
corpus grouped_corpus
list string_list

Outputs

Name Type(s)
set1 grouped corpus with input doc type
set2 grouped corpus with input doc type

Example config

This is an example of how this module can be used in a pipeline config file.

[my_list_filter_module]
type=pimlico.modules.corpora.list_filter
input_corpus=module_a.some_output
input_list=module_a.some_output

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.