Working with NLP and Large Language Models#

This example demonstrates how to use Renate to train NLP models. We will train a sequence classifier to distinguish between positive and negative movie reviews. Using Renate, we will sequentially train this model on two movie review datasets, called "imdb" and "rotten_tomatoes".


Let us take a look at the for this example. In the model_fn function, we use the Hugging Face transformers library to instantiate a sequence classification model. Since this model is static, we can easily turn it into a RenateModule by wrapping it in RenateWrapper.

In the data_module_fn, we load the matching tokenizer from the transformers library. We then use Renate’s HuggingfaceTextDataModule to access datasets from the Hugging Face datasets hub. This data module expects the name of a dataset as well as a tokenizer. Here, we load the "imdb" dataset in the first training stage (chunk_id = 0) and the "rotten_tomatoes" dataset for the subsequent model update (chunk_id = 1).

The function loss_fn defines the appropriate loss criterion. As this is a classification problem we use torch.nn.CrossEntropyLoss.

The data module will return pre-tokenized data and no further transforms are needed in this case.

from typing import Optional

import torch
import transformers

import renate.defaults as defaults
from renate.benchmark.datasets.nlp_datasets import HuggingFaceTextDataModule
from import RenateDataModule
from renate.models import RenateModule
from renate.models.renate_module import RenateWrapper

def model_fn(model_state_url: Optional[str] = None) -> RenateModule:
    """Returns a DistilBert classification model."""
    transformer_model = transformers.DistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", num_labels=2, return_dict=False
    model = RenateWrapper(transformer_model)
    if model_state_url is not None:
        state_dict = torch.load(model_state_url)
    return model

def loss_fn() -> torch.nn.Module:
    return torch.nn.CrossEntropyLoss(reduction="none")

def data_module_fn(data_path: str, chunk_id: int, seed: int = defaults.SEED) -> RenateDataModule:
    """Returns one of two movie review datasets depending on `chunk_id`."""
    tokenizer = transformers.DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    dataset_name = "imdb" if chunk_id else "rotten_tomatoes"
    data_module = HuggingFaceTextDataModule(
    return data_module


As in previous examples, we also include a launch script called For more details on this see previous examples or How to Run a Training Job.

import boto3
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role

from import run_training_job

config_space = {
    "optimizer": "SGD",
    "momentum": 0.9,
    "weight_decay": 0.0,
    "learning_rate": 0.001,
    "alpha": 0.5,
    "batch_size": 64,
    "batch_memory_frac": 0.5,
    "memory_size": 300,
    "loss_normalization": 0,
    "loss_weight": 0.5,

if __name__ == "__main__":
    AWS_ID = boto3.client("sts").get_caller_identity().get("Account")
    AWS_REGION = "us-west-2"  # use your AWS preferred region here

        updater="ER",  # we train with Experience Replay
        # For this example, we can train on two binary movie review datasets: "rotten_tomatoes" and
        # "imdb". Set chunk_id to [0, 1] to switch between the two.
        # replace the url below with a different one if you already ran it and you want to avoid
        # overwriting
        # uncomment the line below only if you already created a model with this script and you want
        # to update it
        # input_state_url=f"s3://sagemaker-{AWS_REGION}-{AWS_ID}/renate-training-nlp-finetuning/",
        backend="sagemaker",  # run on SageMaker, select "local" to run this locally

Support for training large models#

To support training methods for larger models, we expose two arguments in the run_experiment_job to enable training on multiple GPUs. For this we exploit the strategy functionality provided by Lightning large model tutorial and documentation. Currently, we support the strategies:

  • "ddp_find_unused_parameters_false"

  • "ddp"

  • "deepspeed"

  • "deepspeed_stage_1"

  • "deepspeed_stage_2"

  • "deepspeed_stage_2_offload"

  • "deepspeed_stage_3"

  • "deepspeed_stage_3_offload"

  • "deepspeed_stage_3_offload_nvme"

These can be enabled by passing one of the above options to strategy. The number of devices to be used for parallel training can be specified using devices argument which defaults to 1. We also support lower precision training by passing the precision argument which accepts the options "16", "32", "64", "bf16". Note that it has to be a string and not the integer 32. bf16 is restricted to newer hardware and thus need slightly more attention before using it.

See last four lines in the previous code example.