Huggingface text corpus¶

Path	pimlico.modules.input.text.huggingface
Executable	yes

Input reader to fetch a text corpus from Huggingface’s datasets library. See: https://huggingface.co/datasets/.

Uses Huggingface’s load_dataset() function to download a dataset and then converts it to a Pimlico raw text archive.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs¶

No inputs

Outputs¶

Name	Type(s)
default	`grouped_corpus` <`RawTextDocumentType`>

Further conditional outputs¶

In addition to the default output default, if more than one column is specified, further outputs will be provided, each containing a column and named after the column.

The first column name given is always provided as the first (default) output, called “default”.

Options¶

Name	Description	Type
columns	(required) Name(s) of column(s) to store as Pimlico datasets. At least one must be given	comma-separated list of strings
dataset	(required) Name of the dataset to download	string
doc_name	Take the doc names from the named column. The special value ‘enum’ (default) just numbers the sequence of documents	string
name	Name defining the dataset configuration. This corresponds to the second argument of load_dataset()	string
split	Restrict to a split of the data. Must be one of the splits that this dataset provides. The default value of ‘train’ will work for many datasets, but is not guaranteed to be appropriate	string

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_huggingface_text_module]
type=pimlico.modules.input.text.huggingface
columns=text,text,...
dataset=value

This example usage includes more options.

[my_huggingface_text_module]
type=pimlico.modules.input.text.huggingface
columns=text,text,...
dataset=value
doc_name=enum
name=value
split=train

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

huggingface_dataset