renate.benchmark.datasets.wild_time_data module#

class renate.benchmark.datasets.wild_time_data.WildTimeDataModule(data_path, dataset_name, src_bucket=None, src_object_name=None, time_step=0, tokenizer=None, tokenizer_kwargs=None, val_size=0.0, seed=0)[source]#

Bases: DataIncrementalDataModule

Data module wrapping around the Wild-Time data.

Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh, Chelsea Finn: Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. NeurIPS 2022

Parameters:

data_path¶ (Union[Path, str]) – the path to the folder containing the dataset files.
dataset_name¶ (str) – Name of the wild time dataset.
src_bucket¶ (Optional[str]) – the name of the s3 bucket. If not provided, downloads the data from original source.
src_object_name¶ (Optional[str]) – the folder path in the s3 bucket.
time_step¶ (int) – Time slice to be loaded.
tokenizer¶ (Optional[PreTrainedTokenizer]) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.
tokenizer_kwargs¶ (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function.
val_size¶ (float) – Fraction of the training data to be used for validation.
seed¶ (int) – Seed used to fix random number generation.

prepare_data()[source]#

Download data.

If s3 bucket is given, the data is downloaded from s3, otherwise from the original source.

Return type:: None

setup()[source]#

Set up train, test and val datasets.

Return type:: None