Add NetmindMixins to your Training Code
Note that the below documentation is applicable is you wish to submit a training job through the UI. It does NOT apply if you rent GPUs via SSH, in which case you should refer to this section.
Here we share some examples and general guidance for how to train machine learning models on our platform using the NetmindMixins
library. NetmindMixins
is a proprietary library developed by NetMind.AI in order to facilitate training of machine learning models across our distributed platform ecosystem, the NetMind Power platform. It is already pre-installed into all our environments, so you need not worry about installation. Please specify any other required library for your project in a requirements.txt
file and add it to the root folder of your project structure. This will ensure any listed library is installed before running your code. If the file has a different name, it will not be recognized by our system. Below are the instructions and general principles to follow in order to add our NetmindMixins
framework to your existing training code. This is necessary in order to successfully run your code on our platform, as it allows us to monitor essential functions such as training progress, to estimate training time, to save checkpoint on cloud and bill accounts appropriately, and to distribute the workload across multiple machines. While detailed general instructions are shown below, if you want to see some specific examples of how they are implemented in practice, please see the following example files (available in this repository) and notebooks (links below).
Files
story_custom_trainer.py
: in this file, we show how to apply ourNetmindMixins
wrapper to a custom training function.story_hf_trainer_data_parallel.py
: in this file we use thetransformers
'sTrainer
to do the training and use ourNetmindMixins
wrapper to implement data parallelism.story_hf_trainer_model_parallel.py
: in this file we use thetransformers
'sTrainer
to do the training and use ourNetmindMixins
wrapper to implement model parallelism.
Notebooks
Custom Trainer
story_custom_trainer.py
HuggingFace Trainer
story_hf_trainer.py
Model Parallel
story_hf_trainer_model_parallel.py
If you convert a Google Colab notebook to a python file, remember to remove library install lines such as
!pip install ...
!git clone ...
and anything else which is not valid python code.
The instructions given below vary based on whether you are implementing a custom training loop using the PyTorch framework, or using HuggingFace Trainer
class.
The two examples should be generalizable and adaptable to the vast majority of use cases. However, if you feel they do not apply or cannot be applied to your codebase, feel free to join our Discord support channel or reach out at hello@netmind.ai and a member of our team will aim to provide more tailored guidance.
Example 1: Implementing your own Training Loop with PyTorch
Uploading Datasets
In the UI, you will be asked to upload your code and your dataset separately. The dataset upload supports a much larger size limit than the code upload. Both your code and dataset folders should be uploaded as zip files, unless the code consists of a single python file, in which case it can be uploaded directly. If your project has a multi-folder structure, we ask that the "entry file" (the script being run when training your model) be placed into the root folder of the project. This file can then import modules located in other folders as well of course. In order to access the dataset from your python file, you should define a data
training argument in your entry file; your data folder can then be accessed via training_args.data
. See example below
The Power platform will then automatically pass the data argument to your code, which can be accessed through training_args.data
. Below are two examples showing how you can load your data via training_args.data
(each codebase will vary so you will need to tailor this to your code.)
If the folder you uploaded has a structure that can be loaded using HuggingFace datasets library, you can load it in this way
If your uploaded data folder contains files within subfolder which you need to access, for example you need to load the
data/SQuAD/train-v1.1.json
file, you can do it in this way
Initialization
These are the required import statements
and the library is initialized as follows
This initialization line should be placed in your code before the training loop.
Please do not call torch.distributed.init_process_group
after nmp.init
, as it's already called within nmp.init
and doing this twice will raise an error.
Set use_ddp=True
to use data parallelism and use_ddp=False
to use model parallelism. At the moment, we support model parallelism or data parallelism, but we do not support both at the same time. We leave it to the users, as the experts in their models, to make the best decision as to which technique to use, however if in doubt, we recommend starting with data parallelism and if you encounter a memory error (for example your model cannot fit within the GPU RAM even when using low batch sizes), then switch to model parallelism.
If you intend to run the model using data parallelism with use_ddp=True
as shown above, you also need to set the model, and any relevant tensor (such as input ids) to device number zero, as shown below
We will take care of distributing your training across multiple devices. This step does not apply if you are using model parallelism.
Model and Optimizer Instantiation
After you have instantiated your machine learning model and optimizer in the usual way (which will depend on the specifics of your codebase), you should wrap both the model and the optimizer around our model and optimizer classes as shown below.
If you are using distributed data parallelism
If you are not using distributed data parallelism (for example you're using distributed model parallelism instead, or no parallelism)
Training
Before you run the training loop, you should make this inplace training bar initialization
where max_steps
allows you to control the number of total training steps your model will be run for. If you do not want to set a maximum number of steps, we suggest setting this parameter to a very large number, larger than the length of your training dataset multiplied by the number of training epochs, so this number will never be reached, regardless of the batch size chosen.
Inside the training loop you will need to make a call to nmp.step
, which is required to allow us to monitor the training progress. Without this, we cannot estimate the total training time and also cannot allow you to monitor metrics such as the loss via our user interface. Where exactly you place this call will depend on your code structure, but in general it should be placed at the end of each training step, after you have calculated the loss. If you have a standard structure with two for
loops, one for each epoch and one for each training step respectively, your code should look like this
If you have both a training and an evaluation loop in your code, please make sure nmp.step
is added to the training loop. You can also use nmp.evaluate
outside the evaluation loop (if you have one) as shown below.
While nmp.step
needs to be placed inside the training loop, at the end of the innermost loop after each training step, nmp.evaluate
needs to be placed outside the evaluation loop at the end. Adding nmp.evaluate
is optional, but will help us to calculate the evaluation metrics and update the UI.
The variables shown above, such as average_training_loss_per_batch
are for illustration purposes, and you should adapt them based on your code.
At the very end of the training code (after all epochs have finished, outside all training loops), you must call this inplace function once:
Your trained model checkpoints will then become available for you to download from the NetMind Power platform UI once the training is complete.
Example 2: Using HuggingFace Trainer
Class
Trainer
ClassUploading Datasets and Setting Training Arguments
In the UI, you will be asked to upload your code and your dataset separately. The dataset upload supports a much larger size limit than the code upload. Both your code and dataset folders should be uploaded as zip files, unless the code consists of a single python file, in which case it can be uploaded directly. If your project has a multi-folder structure, we ask that the "entry file" (the script being run when training your model) be placed into the root folder of the project. This file can then import modules located in other folders as well of course. In your code you will need to define a ModelTrainingArguments
class which inherits from HuggingFace transformers.TrainingArguments
class, as shown below.
This class sets the training arguments to be passed to the Trainer
. As shown above, you should set data
to None
, and max_steps
as required, which allows you to control the number of total training steps your model will be run for. If you do not want to set a maximum number of steps, we suggest setting this parameter to a very large number, larger than the length of your training dataset multiplied by the number of training epochs, so this number will never be reached, regardless of the batch size chosen. Any other training argument you wish to pass to the Trainer
class should be set within the ModelTrainingArguments
class above.
If you uploaded a dataset, you can access it from your code through training_args.data
. Below are two examples showing how you can load your data via training_args.data
(each codebase will vary so you will need to tailor this to your code.)
If the folder you uploaded has a structure that can be loaded using HuggingFace datasets library, you can load it in this way
If your uploaded data folder contains files within subfolder which you need to access, for example you need to load the
data/SQuAD/train-v1.1.json
file, you can do it in this way
Initialization
These are the required import statements
and the library is initialized as follows
This initialization line should be placed in your code before instantiating the HuggingFace Trainer
class (see below).
Please do not call torch.distributed.init_process_group
after nmp.init
, as it's already called within nmp.init
and doing this twice will raise an error.
Set `use_ddp=True` to use data parallelism and `use_ddp=False` to use model parallelism. At the moment, we support model parallelism or data parallelism, but we do not support both at the same time. We leave it to the users, as the experts in their models, to make the best decision as to which technique to use, however if in doubt, we recommend starting with data parallelism and if you encounter a memory error (for example your model cannot fit within the GPU RAM even when using low batch sizes), then switch to model parallelism.
If you need to implement your own training and / or evaluation metrics, you should create a class inheriting from NetmindTrainerCallback
and adapt it to your use case, as shown in the example below.
If you do not need to add any custom metrics on training or evaluation, you can skip the step above and use NetmindTrainerCallback
directly when instantiating the Trainer
(see explanation below).
Training
When implementing the HuggingFace Trainer
, you will need to set args=training_args
(see above for how training_args
was defined) and add the callbacks=[CustomTrainerCallback]
as an argument of your Trainer
class as shown below. If you didn't define a CustomTrainerCallback
class, then you can set this argument to callbacks=[NetmindTrainerCallback]
. After the training is done, you also need to call nmp.finish_training
inplace. See the example below
Your trained model checkpoints will then become available for you to download from the NetMind Power platform UI once the training is complete.
Last updated