20 Newsgroups fetcher (sklearn)¶

Path	pimlico.modules.input.text.20newsgroups.sklearn_download
Executable	yes

Input reader to fetch the 20 Newsgroups dataset from Sklearn. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

The original data can be downloaded from http://qwone.com/~jason/20Newsgroups/.

This module does not support Python 2, so can only be used when Pimlico is being run under Python 3

Inputs¶

No inputs

Outputs¶

Name	Type(s)
text	`grouped_corpus` <`RawTextDocumentType`>
labels	`grouped_corpus` <`IntegerDocumentType`>

Options¶

Name	Description	Type
limit	Truncate corpus	int
random_state	Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple runs	int
remove	May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata	comma-separated list of strings
shuffle	Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent	bool
subset	Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering	‘train’, ‘test’ or ‘all’

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_20ng_fetcher_module]
type=pimlico.modules.input.text.20newsgroups.sklearn_download

This example usage includes more options.

[my_20ng_fetcher_module]
type=pimlico.modules.input.text.20newsgroups.sklearn_download
limit=0
random_state=0
remove=text,text,...
shuffle=T
subset=train