renate.benchmark.datasets.nlp_datasets module#

class renate.benchmark.datasets.nlp_datasets.HuggingFaceTextDataModule(data_path, tokenizer, dataset_name='ag_news', input_column='text', target_column='label', tokenizer_kwargs=None, val_size=0.0, seed=0)[source]#

Bases: RenateDataModule

Data module wrapping Hugging Face text datasets.

This is a convenience wrapper to expose a Hugging Face dataset as a RenateDataModule. Datasets will be pre-tokenized and will return input, target = dataset[i], where input is a dictionary with fields [“input_ids”, “attention_mask”], and target is a tensor.

We expect the dataset to have a “train” and a “test” split. An additional “validation” split will be used if present. Otherwise, a validation set may be split off of the training data using the val_size argument.

Parameters:

data_path¶ (str) – the path to the folder containing the dataset files.
tokenizer¶ (PreTrainedTokenizer) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.
dataset_name¶ (str) – Name of the dataset, see https://huggingface.co/datasets. This is a wrapper for text datasets only.
input_column¶ (str) – Name of the column containing the input text.
target_column¶ (str) – Name of the column containing the target (e.g., class label).
tokenizer_kwargs¶ (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function. Typical options are max_length, padding and truncation. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. If None is passed, this defaults to {“padding”: “max_length”, max_length: 128, truncation: True}.
val_size¶ (float) – Fraction of the training data to be used for validation.
seed¶ (int) – Seed used to fix random number generation.

prepare_data()[source]#

Download data.

Return type:: None

setup()[source]#

Set up train, test and val datasets.

Return type:: None

class renate.benchmark.datasets.nlp_datasets.MultiTextDataModule(data_path, tokenizer, data_id, tokenizer_kwargs=None, train_size=115000, test_size=7600, val_size=0.0, seed=0)[source]#

Bases: DataIncrementalDataModule

Inspired by the dataset used in “Episodic Memory in Lifelong Language Learning” by d’Autume et al. this is a collection of four different datasets that we call domains: AGNews, Yelp, DBPedia and Yahoo Answers.

The output space if the union of the output space of all the domains. The dataset has 33 classes: 4 from AGNews, 5 from Yelp, 14 from DBPedia, and 10 from Yahoo.

The largest available size for the training set is 115000 and for the test set is 7600.

Parameters:

data_path¶ (str) – The path to the folder where the data files will be downloaded to.
tokenizer¶ (PreTrainedTokenizer) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.
tokenizer_kwargs¶ (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function. Typical options are max_length, padding and truncation. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. If None is passed, this defaults to {“padding”: “max_length”, max_length: 128, truncation: True}.
data_id¶ (str) – The dataset to be used
train_size¶ (int) – The size of the data stored as training set, must be smaller than 115000.
test_size¶ (int) – The size of the data stored as test set, must be smaller than 7600.
val_size¶ (float) – Fraction of the training data to be used for validation.
seed¶ (int) – Seed used to fix random number generation.

domains = ['ag_news', 'yelp_review_full', 'dbpedia_14', 'yahoo_answers_topics']#

prepare_data()[source]#

Download dataset.

Return type:: None

setup()[source]#

Set up train, test and val datasets.

Return type:: None