renate.benchmark.datasets.wild_time_data module#

class renate.benchmark.datasets.wild_time_data.WildTimeDataModule(data_path, dataset_name, src_bucket=None, src_object_name=None, time_step=0, tokenizer=None, tokenizer_kwargs=None, val_size=0.0, seed=0)[source]#

Bases: DataIncrementalDataModule

Data module wrapping around the Wild-Time data.

Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh, Chelsea Finn: Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. NeurIPS 2022

Parameters:
  • data_path (Union[Path, str]) – the path to the folder containing the dataset files.

  • dataset_name (str) – Name of the wild time dataset.

  • src_bucket (Optional[str]) – the name of the s3 bucket. If not provided, downloads the data from original source.

  • src_object_name (Optional[str]) – the folder path in the s3 bucket.

  • time_step (int) – Time slice to be loaded.

  • tokenizer (Optional[PreTrainedTokenizer]) – Tokenizer to apply to the dataset. See https://huggingface.co/docs/tokenizers/ for more information on tokenizers.

  • tokenizer_kwargs (Optional[Dict[str, Any]]) – Keyword arguments passed when calling the tokenizer’s __call__ function.

  • val_size (float) – Fraction of the training data to be used for validation.

  • seed (int) – Seed used to fix random number generation.

prepare_data()[source]#

Download data.

If s3 bucket is given, the data is downloaded from s3, otherwise from the original source.

Return type:

None

setup()[source]#

Set up train, test and val datasets.

Return type:

None