This project implements an image captioning system using various decoder architectures (RNN, GRU, and LSTM) on the COCO 2017 dataset. The system uses a pre-trained ResNet-50 as the encoder and different recurrent architectures for caption generation.
├── coco_dataset/
│ ├── annotations/
│ │ ├── captions_train2017.json
│ │ └── captions_val2017.json
│ ├── train2017/
│ └── val2017/
├── models/
│ ├── EncoderCNN.py
│ └── DecoderModels.py
├── train.py
├── test.py
└── requirements.txt
torch
torchvision
transformers
pycocotools
nltk
pillow
tqdm
numpy
matplotlib
pycocoevalcap
Install dependencies:
pip install -r requirements.txt
The project uses the COCO 2017 dataset. Download it from COCO website and place it in the coco_dataset
directory.
We have also created a shell file that does this for you. Run ./setup_coco_caption_dataset.sh
to download the dataset.
-
Encoder:
- Pre-trained ResNet-50
- Feature dimension: 2048 → embed_size
-
Decoders:
- RNN: Basic recurrent neural network
- GRU: Gated Recurrent Unit
- LSTM: Long Short-Term Memory
Implements the encoder using a pre-trained ResNet-50 model.
Classes:
EncoderCNN
: Extracts image features using ResNet-50- Input: Images (batch_size, 3, 224, 224)
- Output: Image features (batch_size, embed_size)
Contains different decoder architectures for caption generation.
Classes:
DecoderRNN
: Basic RNN decoderDecoderGRU
: GRU-based decoderDecoderLSTM
: LSTM-based decoder
Each decoder class implements:
forward()
: Training forward passsample()
: Caption generation for inference
Main training script for the image captioning models.
Key Components:
CocoDataset
: Custom dataset class for COCOtrain_model()
: Training loop implementation- Inputs:
- encoder: CNN encoder model
- decoder: RNN decoder model
- data_loader: Training data loader
- num_epochs: Number of training epochs
- Outputs:
- Saved model checkpoints
- Inputs:
Evaluation script for trained models.
Key Functions:
generate_caption()
: Generates caption for a single imagecalculate_metrics()
: Computes evaluation metrics- BLEU-1,2,3,4
- METEOR
- CIDEr
- ROUGE
visualize_prediction()
: Creates visualization of predictions
python train.py
This will train all decoder variants sequentially.
python test.py
This evaluates all trained models and computes metrics.
Checkpoints are saved in the following format:
- Encoder:
checkpoints/encoder-{model_type}-{epoch}.pkl
- Decoder:
checkpoints/decoder-{model_type}-{epoch}.pkl
where model_type
is one of: ['rnn', 'gru', 'lstm']
Results are saved in:
- Model predictions:
results/prediction_{name}_{random_id}.png
- Metrics are printed for each model showing:
- BLEU scores (1-4)
- METEOR score
- CIDEr score
- ROUGE score
- Embedding size: 256
- Hidden size: 512
- Batch size: 128
- Learning rate: 0.001
- Number of epochs: 5