- Knowledge Distillation (KD) - pytorch
- PyTorch implementation of Distilling the Knowledge in a Neural Network
- This repository is forked from pytorch-cifar.
- Dataset: CIFAR10
- Teacher Network: VGG16
- Student Network: CNN with 3 convolutional blocks
- Python 3.6+
- PyTorch 1.0+
- See
docker/
folder.
- Result: 91.90%
- SGD, no weight decay.
- Learning rate adjustment
0.1
for epoch[1,150]
0.01
for epoch[151,250]
0.001
for epoch[251,300]
python -m pretrainer --optimizer=sgd --lr=0.1 --start_epoch=1 --n_epoch=150 --model_name=ckpt
python -m pretrainer --optimizer=sgd --lr=0.01 --start_epoch=151 --n_epoch=100 --model_name=ckpt --resume
python -m pretrainer --optimizer=sgd --lr=0.001 --start_epoch=251 --n_epoch=50 --model_name=ckpt --resume
- We use Adam optimizer for fair comparison.
- max epoch:
300
- learning rate:
0.0001
- no weight decay for fair comparison.
- max epoch:
- Result: 85.01%
python -m pretrainer --optimizer=adam --lr=0.0001 --start_epoch=1 --n_epoch=300 --model_name=student-scratch --network=studentnet
- Similar performance.
python -m trainer --T=1.0 --alpha=1.0 --kd_mode=cse # 84.99%
python -m trainer --T=1.0 --alpha=1.0 --kd_mode=mse # 84.85%
- alpha = 0.5 may show better performance.
python -m trainer --T=1.0 --alpha=1.0 --kd_mode=cse # 84.99%
python -m trainer --T=1.0 --alpha=0.5 --kd_mode=cse # 85.38%
python -m trainer --T=1.0 --alpha=1.0 --kd_mode=mse # 84.85%
python -m trainer --T=1.0 --alpha=0.5 --kd_mode=mse # 84.92%
- Higher the temperature, better the performance. Consistent results with the paper.
python -m trainer --T=1.0 --alpha=0.5 --kd_mode=cse # 85.38%
python -m trainer --T=2.0 --alpha=0.5 --kd_mode=cse # 85.27%
python -m trainer --T=4.0 --alpha=0.5 --kd_mode=cse # 86.46%
python -m trainer --T=8.0 --alpha=0.5 --kd_mode=cse # 86.33%
python -m trainer --T=16.0 --alpha=0.5 --kd_mode=cse # 86.58%
- alpha=0.5 seems to be local optimal.
python -m trainer --T=16.0 --alpha=0.1 --kd_mode=cse # 85.69%
python -m trainer --T=16.0 --alpha=0.3 --kd_mode=cse # 86.48%
python -m trainer --T=16.0 --alpha=0.5 --kd_mode=cse # 86.58%
python -m trainer --T=16.0 --alpha=0.7 --kd_mode=cse # 86.16%
python -m trainer --T=16.0 --alpha=0.9 --kd_mode=cse # 86.08%
python -m trainer --T=16.0 --alpha=0.5 --kd_mode=cse --optimizer=sgd-cifar10 # 87.04%
python -m pretrainer --model_name=student-scratch-sgd-cifar10 --network=studentnet --optimizer=sgd-cifar10 # 86.34%
- fix seed.
- multi gpu handling.
- split validation set.
- experiments with 5 random seed.
- remove code redundancy.
- check the optimal T is equal to calibrated T.
- Progressbar code fix in
trainer.py
.