How to Run a Training Job#

Renate offers possibility to run training jobs using both CLIs and functions that can be called programmatically in python. The best choice may be different depending on the requirements (e.g., a CLI can be convenient to run remote jobs). In the following we illustrate the solution that we find to be the simplest and more convenient. The complete documentation is available in run_training_job().

Setup#

The first step that needs to be completed before running a training job is to define which model needs to be trained and on which data. This is explained in How to Write a Config File.

Once completed the first step, a simple way to run a training job is to use the run_training_job(), this can work for most training needs: it can launch trainings with and without HPO, either locally or on Amazon SageMaker.

Run a local training job#

Running a local training is very easy: it requires providing a configuration and call a function. The configuration is stored in a dictionary and may contain different arguments depending on the method you want to use to update your model.

In this example we will use a simple Experience Replay method and provide a configuration similar to the following one. The arguments to be specified in the configuration are passed to the ModelUpdater instantiating the method you selected. See Supported Algorithms for more information about the methods.

configuration = {
    "optimizer": "SGD",
    "momentum": 0.0,
    "weight_decay": 1e-2,
    "learning_rate": 0.05,
    "batch_size": 32,
    "max_epochs": 50,
    "memory_batch_size": 32,
    "memory_size": 500,
}

Note

If you have defined the optimizer_fn function in your Renate config, do not pass values for the keys optimizer, momentum, weight_decay, or learning_rate, unless you have specified them as custom arguments.

Once the configuration of the learning algorithm is specified, we need to set another couple of arguments in the run_training_job() function to make sure we obtain the desired behavior:

  • mode: it can be either min or max and define if the aim is to minimize or maximize the metric

  • metric: it is the target metric. Metrics measured on the validation set are prefixed with val_, while the ones measured on the training set are prefixed with train_. Mode an metric will be used to checkpoint the best model if a validation set is provided, otherwise do not pass these arguments.

  • updater: the name of the algorithm to be used for updating the model. See Supported Algorithms for more info.

  • max_epochs: the maximum number of training epochs.

  • input_state_url: this is the location at which the state of learner and the model to be updated are made available. If this argument is not passed, the model will be trained from scratch.

  • output_state_url: this is the location at which the output of the training job (e.g., model, state) will be stored.

  • backend: when set to local will run the training job on the local machine

In both cases the urls can point to local folders or S3 locations.

Putting everything together will result in a script like the following.

from renate.training import run_training_job

configuration = {
    "optimizer": "SGD",
    "momentum": 0.0,
    "weight_decay": 1e-2,
    "learning_rate": 0.05,
    "batch_size": 64,
    "batch_memory_frac": 0.5,
    "max_epochs": 50,
    "memory_size": 500,
}

if __name__ == "__main__":
    run_training_job(
        config_space=configuration,
        mode="max",
        metric="val_accuracy",
        updater="ER",
        max_epochs=50,
        chunk_id=0,
        config_file="renate_config.py",
        output_state_url="./output_folder/",  # this is where the model will be stored
        backend="local",  # the training job will run on the local machine
    )

Once the training has been executed you will see some metrics printed on the screen (e.g., validation accuracy) and you will find the output of the training process in the folder specified. For more information about the output see Renate’s output.

Run a training job on SageMaker#

Running a job on SageMaker is very similar to run the training job locally, but it will require a few changes to the arguments passed to run_training_job():

  • backend: the backend will have to be set to sagemaker.

  • role: an execution role will need to be passed.

  • instance_type: the type of machine to be used for the training. AWS provides a list of training instances available.

  • job_name: (optional) a prefix used to name the training job to make it recognizable in the SageMaker jobs list.

When using the SageMaker training you should use a S3 location as next_state_url to make sure you have access to the result after the job has finished. We provide an example in Training and Tuning on SageMaker.

Run a training job with HPO#

Running the training job with hyperparameter optimization (HPO) will require a few minor additions to the components already discussed.

The first step to run an HPO job is to define the search space. To this purpose, it will be sufficient to extend our configuration to include some ranges instead of exact values. If a hyperparameter does not need to be tuned, an exact value can be provided.

config_space = {
    "optimizer": "SGD",
    "momentum": 0.0,
    "weight_decay": 1e-2,
    "learning_rate": uniform(0.001, 0.1),
    "batch_size": choice([32,64,128]),
    "max_epochs": 50,
    "memory_batch_size": uniform(1, 32),
    "memory_size": 500,
}

For more suggestions and details about how to design a search space, see the Syne Tune documentation. If you do not know which search space to use, you can adopt a default one by calling config_space() and passing the name of your algorithm to it.

from renate.utils.config_spaces import config_space

config_space("ER")

After configuring the search space, it will be sufficient to add a few more arguments to the run_training_job() function. To start, please make sure that mode and metric (already introduced above) reflect your aim. Also, please make sure that in data_module_fn a reasonable fraction of the data is assigned to the validation set, otherwise it will not be possible to measure validation performance reliably (val_size is controlling that).

It also possible to define more aspects of the HPO process:

  • n_workers: the number of workers evaluating configurations in parallel (useful for multi-cpu or multi-gpu machines).

  • scheduler: to decide which optimizer to use for the hyperparameters (e.g., “bo”, “asha”).

  • specify one of the stopping criteria available, for example max_time stops the tuning job after a certain amount of time.

After defining these arguments it will be sufficient to run the script and wait :) The output will be available in the location specified in output_state_url.

We provide an example of training on SageMaker with HPO at Training and Tuning on SageMaker.

Custom Function Arguments#

Now that we know how to run basic training jobs, we can discuss how to use custom defined function arguments. We are building upon the linear model example introduced in the previous chapter where we added num_inputs and num_outputs to data_module_fn. The values for these inputs are passed via the configuration alongside the other arguments.

config_space = {
    # Define all remaining standard arguments as well
    "num_inputs": 28 * 28,
    "num_outputs": 10,
}

While it does not make any sense for this example, we can also define ranges for our custom function arguments and automatically optimize them during hyperparameter optimization.

Note

If you have functions defined with the same argument name, the value defined in the configuration will be passed to both.