renate.benchmark.datasets.nlp_datasets module#

class renate.benchmark.datasets.nlp_datasets.HuggingFaceTextDataModule(data_path, tokenizer, dataset_name='ag_news', input_column='text', target_column='label', tokenizer_kwargs=None, val_size=0.0, seed=0)[source]#

Bases: RenateDataModule

Data module wrapping Hugging Face text datasets.

This is a convenience wrapper to expose a Hugging Face dataset as a RenateDataModule. Datasets will be pre-tokenized and will return input, target = dataset[i], where input is a dictionary with fields [“input_ids”, “attention_mask”], and target is a tensor.

We expect the dataset to have a “train” and a “test” split. An additional “validation” split will be used if present. Otherwise, a validation set may be split off of the training data using the val_size argument.

Parameters:
  • data_path (str) – the path to the folder containing the dataset files.

  • tokenizer (PreTrainedTokenizer) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.

  • dataset_name (str) – Name of the dataset, see https://huggingface.co/datasets. This is a wrapper for text datasets only.

  • input_column (str) – Name of the column containing the input text.

  • target_column (str) – Name of the column containing the target (e.g., class label).

  • tokenizer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function. Typical options are max_length, padding and truncation. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. If None is passed, this defaults to {“padding”: “max_length”, max_length: 128, truncation: True}.

  • val_size (float) – Fraction of the training data to be used for validation.

  • seed (int) – Seed used to fix random number generation.

prepare_data()[source]#

Download data.

Return type:

None

setup()[source]#

Set up train, test and val datasets.

Return type:

None

class renate.benchmark.datasets.nlp_datasets.MultiTextDataModule(data_path, tokenizer, data_id, tokenizer_kwargs=None, train_size=115000, test_size=7600, val_size=0.0, seed=0)[source]#

Bases: DataIncrementalDataModule

Inspired by the dataset used in “Episodic Memory in Lifelong Language Learning” by d’Autume et al. this is a collection of four different datasets that we call domains: AGNews, Yelp, DBPedia and Yahoo Answers.

The output space if the union of the output space of all the domains. The dataset has 33 classes: 4 from AGNews, 5 from Yelp, 14 from DBPedia, and 10 from Yahoo.

The largest available size for the training set is 115000 and for the test set is 7600.

Parameters:
  • data_path (str) – The path to the folder where the data files will be downloaded to.

  • tokenizer (PreTrainedTokenizer) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.

  • tokenizer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function. Typical options are max_length, padding and truncation. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. If None is passed, this defaults to {“padding”: “max_length”, max_length: 128, truncation: True}.

  • data_id (str) – The dataset to be used

  • train_size (int) – The size of the data stored as training set, must be smaller than 115000.

  • test_size (int) – The size of the data stored as test set, must be smaller than 7600.

  • val_size (float) – Fraction of the training data to be used for validation.

  • seed (int) – Seed used to fix random number generation.

domains = ['ag_news', 'yelp_review_full', 'dbpedia_14', 'yahoo_answers_topics']#
prepare_data()[source]#

Download dataset.

Return type:

None

setup()[source]#

Set up train, test and val datasets.

Return type:

None