renate.benchmark.datasets.nlp_datasets module#
- class renate.benchmark.datasets.nlp_datasets.HuggingFaceTextDataModule(data_path, tokenizer, dataset_name='ag_news', input_column='text', target_column='label', tokenizer_kwargs=None, val_size=0.0, seed=0)[source]#
Bases:
RenateDataModule
Data module wrapping Hugging Face text datasets.
This is a convenience wrapper to expose a Hugging Face dataset as a
RenateDataModule
. Datasets will be pre-tokenized and will returninput, target = dataset[i]
, whereinput
is a dictionary with fields["input_ids", "attention_mask"]
, andtarget
is a tensor.We expect the dataset to have a “train” and a “test” split. An additional “validation” split will be used if present. Otherwise, a validation set may be split off of the training data using the
val_size
argument.- Parameters:
data_path¶ (
str
) – the path to the folder containing the dataset files.tokenizer¶ (
PreTrainedTokenizer
) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.dataset_name¶ (
str
) – Name of the dataset, see https://huggingface.co/datasets. This is a wrapper for text datasets only.input_column¶ (
str
) – Name of the column containing the input text.target_column¶ (
str
) – Name of the column containing the target (e.g., class label).tokenizer_kwargs¶ (
Optional
[Dict
[str
,Any
]]) – Keyword arguments passed when calling the tokenizer’s__call__
function. Typical options aremax_length
,padding
andtruncation
. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. IfNone
is passed, this defaults to{"padding": "max_length", max_length: 128, truncation: True}
.val_size¶ (
float
) – Fraction of the training data to be used for validation.seed¶ (
int
) – Seed used to fix random number generation.
- class renate.benchmark.datasets.nlp_datasets.MultiTextDataModule(data_path, tokenizer, data_id, tokenizer_kwargs=None, train_size=115000, test_size=7600, val_size=0.0, seed=0)[source]#
Bases:
DataIncrementalDataModule
Inspired by the dataset used in “Episodic Memory in Lifelong Language Learning” by d’Autume et al. this is a collection of four different datasets that we call domains: AGNews, Yelp, DBPedia and Yahoo Answers.
The output space if the union of the output space of all the domains. The dataset has 33 classes: 4 from AGNews, 5 from Yelp, 14 from DBPedia, and 10 from Yahoo.
The largest available size for the training set is 115000 and for the test set is 7600.
- Parameters:
data_path¶ (
str
) – The path to the folder where the data files will be downloaded to.tokenizer¶ (
PreTrainedTokenizer
) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.tokenizer_kwargs¶ (
Optional
[Dict
[str
,Any
]]) – Keyword arguments passed when calling the tokenizer’s__call__
function. Typical options aremax_length
,padding
andtruncation
. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers. IfNone
is passed, this defaults to{"padding": "max_length", max_length: 128, truncation: True}
.data_id¶ (
str
) – The dataset to be usedtrain_size¶ (
int
) – The size of the data stored as training set, must be smaller than 115000.test_size¶ (
int
) – The size of the data stored as test set, must be smaller than 7600.val_size¶ (
float
) – Fraction of the training data to be used for validation.seed¶ (
int
) – Seed used to fix random number generation.
- domains = ['ag_news', 'yelp_review_full', 'dbpedia_14', 'yahoo_answers_topics']#