Welcome to the PyTorch Docker Assignment. This assignment is designed to help you understand and work with Docker and PyTorch.
In this assignment, you will:
- Create a Dockerfile for a PyTorch (CPU version) environment.
- Keep the size of your Docker image under 1GB (uncompressed).
- Train any model on the MNIST dataset inside the Docker container.
- Save the trained model checkpoint to the host operating system.
- Add an option to resume model training from a checkpoint.
The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. You will need to complete the code at locations marked by TODO: comments.
When you have completed the assignment, push your code to your Github repository. The Github Actions workflow will automatically build your Docker image, run your training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. Make sure that all checks are passing before you submit the assignment.
This repository contains a PyTorch implementation to train and test a neural network on the MNIST dataset. The model architecture includes convolutional and fully connected layers designed to classify images of handwritten digits (0-9). The script allows for customizable training options via command-line arguments.
- Customizable model training with configurable batch size, epochs, learning rate, and more.
- Model checkpointing for saving and resuming training from saved states.
- Logging of training progress and performance metrics during each epoch.
- Support for CUDA and macOS Metal (MPS) GPU acceleration.
- Command-line argument parsing for ease of use.
- Python 3.7+
- PyTorch 1.9+
- Torchvision
- argparse
Install the required dependencies with:
pip install torch torchvision
To train the model from scratch, use the following command:
python train.py --batch-size 64 --epochs 14 --lr 1.0
The following arguments are supported to customize the training process:
--batch-size
(default: 64): Input batch size for training.--test-batch-size
(default: 1000): Input batch size for testing.--epochs
(default: 14): Number of epochs to train.--lr
(default: 1.0): Learning rate for the optimizer.--gamma
(default: 0.7): Learning rate step decay factor.--no-cuda
: Disable CUDA (GPU) training even if CUDA is available.--no-mps
: Disable macOS GPU training.--dry-run
: Run a quick single batch to check if the pipeline works.--log-interval
(default: 10): Number of batches to wait before logging training status.--save-model
(default: True): Save the model at each epoch.--resume
: Resume training from the last saved checkpoint.--seed
(default: 1): Seed for random number generation.
To resume training from a saved checkpoint, use the --resume
flag:
python train.py --resume
Ensure that the model_checkpoint.pth
file is present in the current directory.
python train.py --batch-size 64 --epochs 10 --lr 0.1 --gamma 0.9 --log-interval 20 --save-model
The model consists of the following layers:
- Two convolutional layers (
conv1
andconv2
) - Two dropout layers (
dropout1
anddropout2
) to prevent overfitting - Fully connected layers (
fc1
andfc2
) - Softmax output for multi-class classification
The model is trained using the Negative Log-Likelihood (NLL) loss function and optimized using the Adadelta optimizer. The script implements both a training loop and a testing loop to evaluate model performance on the MNIST test set after each epoch.
Training logs are printed periodically based on the --log-interval
argument, showing the progress and loss for each batch.
The model, optimizer state, and current epoch are saved after each epoch in model_checkpoint.pth
. This allows you to resume training from where you left off using the --resume
flag.