This repository provides a PyTorch Lighting Template for Distribute Training on Azure ML.
This template was aim to user PytorchLightning 2.0 or Higher.
PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research. It makes your code neatly organized and provides lots of useful features, like ability to run model on CPU, GPU, multi-GPU cluster and TPU.
It comes to the partnership between Microsoft Azure and OpenAI. This groundbreaking collaboration introduces a cloud-based platform designed to empower developers and data scientists to swiftly and effortlessly build and deploy AI models. Leveraging Azure OpenAI, users gain access to a comprehensive suite of cutting-edge AI tools and technologies, enabling intelligent applications that harness the power of natural language processing, computer vision, and deep learning.
To adhere to best practices, we will store all Azure SDK-related code in separate Python files located in the azure-jobs folder. Jobs can be seen as the connecting element between the compute cluster, data asset components, and PyTorch code. Conversely, the native PyTorch code will be placed in the src folder. As a result, if we decide to run our training on a different cloud provider, no modifications will be required in the src folder. Here is an example of how folder extractor could look like:
.
├── README.md
├── azure-jobs/
│ ├── config/
│ │ └── workspace.json
│ └── job.py
└── src/
├── datamodule.py
├── model.py
├── transforms.py
└── trainer.py
I wrote this Tutorial as a comprehensive guide for Distributed Training (Multiple Nodes and multiple GPUs per node) with PyTorch Lightning on Azure ML.