Huggingface text corpus¶
Path | pimlico.modules.input.text.huggingface |
Executable | yes |
Input reader to fetch a text corpus from Huggingface’s datasets library. See: https://huggingface.co/datasets/.
Uses Huggingface’s load_dataset()
function to download a dataset and
then converts it to a Pimlico raw text archive.
This module does not support Python 2, so can only be used when Pimlico is being run under Python 3
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
default | grouped_corpus <RawTextDocumentType > |
Further conditional outputs¶
In addition to the default output default
, if more than one column is specified,
further outputs will be provided, each containing a column and named after the column.
The first column name given is always provided as the first (default) output, called “default”.
Options¶
Name | Description | Type |
---|---|---|
columns | (required) Name(s) of column(s) to store as Pimlico datasets. At least one must be given | comma-separated list of strings |
dataset | (required) Name of the dataset to download | string |
doc_name | Take the doc names from the named column. The special value ‘enum’ (default) just numbers the sequence of documents | string |
name | Name defining the dataset configuration. This corresponds to the second argument of load_dataset() | string |
split | Restrict to a split of the data. Must be one of the splits that this dataset provides. The default value of ‘train’ will work for many datasets, but is not guaranteed to be appropriate | string |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_huggingface_text_module]
type=pimlico.modules.input.text.huggingface
columns=text,text,...
dataset=value
This example usage includes more options.
[my_huggingface_text_module]
type=pimlico.modules.input.text.huggingface
columns=text,text,...
dataset=value
doc_name=enum
name=value
split=train
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.