Skip to content

Testing decoupled knowledge distillation for image classification task using models with <5 convolutional layers

License

Notifications You must be signed in to change notification settings

ufestkc/DKD-on-simple-models

Repository files navigation

Decoupled Knowledge Distillation(DKD) for simple models

This repo is based on the CVPR-2022 paper: Decoupled Knowledge Distillation.

Framework & Performance



Benchmark Results of DKD-paper on CIFAR-100

Teacher
Student
ResNet56
ResNet20
ResNet110
ResNet32
ResNet32x4
ResNet8x4
WRN-40-2
WRN-16-2
WRN-40-2
WRN-40-1
VGG13
VGG8
KD 70.66 73.08 73.33 74.92 73.54 72.98
DKD 71.97 74.11 76.32 76.23 74.81 74.68
Teacher
Student
ResNet32x4
ShuffleNet-V1
WRN-40-2
ShuffleNet-V1
VGG13
MobileNet-V2
ResNet50
MobileNet-V2
ResNet32x4
MobileNet-V2
KD 74.07 74.83 67.37 67.35 74.45
DKD 76.45 76.70 69.71 70.35 77.07

Environment

  • Python 3.8.10
  • PyTorch 2.0.0
  • CUDA 11.7

Motivation

  • Main motivation for this experiment is to measure decoupled knowledge distillation performance on very simple lightweight models and compare it with classical knowledge distillation. The paper did not explore performance of very simple models with less than 10 layers, in comparison to the classical paper: Distilling the Knowledge in a Neural Network

  • For the experiment, I chose the teacher model as a 5-layer architecture (2 convolution + 3 fully connected layer):

    • For Regularization:
      • I added a dropout of p=0.15 before flattening the intermediate layer
      • I added batch normalization after each convolution and pooling operation
    • The architecture is as follows:
      • conv2d->maxpool->batchnorm->conv2d->maxpool->batchnorm->flatten->dropout->fc+relu->fc+relu->fc
  • I maintained the student model to be slightly smaller architecture (2 convolution + 2 fully connected layer):

    • The architecture is as follows:
      • conv2d->maxpool->conv2d->maxpool->flatten->fc + relu->fc

Evaluation

  • I evaluate the performance of my model using CIFAR-10 dataset, and measure the mean validation accuracy.
  • I perform a comparison among student model with decoupled knowledge distillation, basic knowledge distillation, and no distillation.
  • Some of the results can be viewed in checkpoints.

RESULTS

  • For decoupled knowledge distillation and also knowledge distillation, we use the hyperparameters alpha(a), beta(b), and temperature(T)
  • For DKD, keeping T > 1.0 (e.g T = 2.0), makes the loss: NAN, and drops accuracy to 1.0%. I tried a range of values of alpha and beta for DKD, and found the best accuracy to be 59.27%.
  • Also, making the DKD loss rely entirely of TCKD (target class knowledge distillation) drops accuracy. There has been no prior explanation on why such values of hyperparameters gives loss: NAN and no accuracy change results. Based on the DKD paper, the beta values show performance increase when beta is 2.0, 4.0, 8.0, increased in that order, but the results were derived from more complex teacher student architectures.
  • Also, experimenting with a = 0.5 and a > 1.0 showed worse performance for the student model.
  • In the experiment, 30 epochs were used because other-wise the models overfit, and validation accuracy starts decreasing. Based on experimentation trials, a batch size of 64 was used as it best suited the training.
DKD Student DKD Student accuracy KD Student KD Student accuracy Student(No-KD) Student accuracy
DKD(T=1.0, a=1.0, b=2.0) 56.28 KD(T=5.0, a=0.5) 57.92 No-KD 56.49
DKD(T=1.0, a=1.0, b=1.5) 53.66 KD(T=5.0, a=0.5) 57.92 No-KD 56.49
DKD(T=1.0, a=1.0, b=2.5) 36.71 KD(T=1.0, a=0.5) 59.87 No-KD 56.74
DKD(T=1.0, a=0.7, b=1.5) 52.84 KD(T=1.0, a=0.5) 59.87 No-KD 56.74
DKD(T=1.0, a=0.7, b=1.0) 56.07 KD(T=1.0, a=0.5) 59.87 No-KD 56.74
DKD(T=1.0, a=0.7, b=0.75) 56.84 KD(T=1.0, a=0.5) 59.87 No-KD 56.74
DKD(T=1.0, a=0.7, b=0.50) 58.15 KD(T=1.0, a=0.5) 59.87 No-KD 56.74
DKD(T=1.0, a=0.7, b=0.25) 59.27 KD(T=1.0, a=0.5) 59.87 No-KD 56.74


- My results show that IT IS not always the case that decoupled knowledge distillation outperforms knowledgde distillation. One of the reasons could lie in improper fine-tuning of hyperparameters. Also, it difficult to determine the weights/coefficients of the TCKD and NCKD loss, as we do not have much study into the importance of each loss category. Also, the 'very' simplicity of the model architectures could hamper the performance of DKD.

- For curiosity, I also tested one of the benchmarks using 240 epochs, batch size 64, on ResNet-50(teacher) and MobileNetV2(student). - Here, I used CIFAR-100 dataset. The procedures were closely followed as in the paper, and gave positive results.
Teacher
Student
ResNet50
MobileNet-V2
No-KD 52.81%
KD 56.41
DKD 57.89


Citation

@article{zhao2022dkd,
  title={Decoupled Knowledge Distillation},
  author={Zhao, Borui and Cui, Quan and Song, Renjie and Qiu, Yiyu and Liang, Jiajun},
  journal={arXiv preprint arXiv:2203.08679},
  year={2022}
}

License

This project is under the MIT license. See LICENSE for details.

About

Testing decoupled knowledge distillation for image classification task using models with <5 convolutional layers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published