Wired behaviors of AdaHessian on ResNext-50 #11

XuezheMax · 2020-11-18T08:38:13Z

Hi,

Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.

However, I got some wired observations. Please see the training log:

Epoch: 1/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 6.1249, top1: 2.74%, top5: 8.40%, time: 9660.5s

Avg  loss: 4.7754, top1: 10.54%, top5: 27.53%

Best loss: 4.7754, top1: 10.54%, top5: 27.53%, epoch: 1

Epoch: 2/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 4.2148, top1: 18.27%, top5: 38.85%, time: 9638.9s

Avg  loss: 3.4256, top1: 27.41%, top5: 53.10%

Best loss: 3.4256, top1: 27.41%, top5: 53.10%, epoch: 2

Epoch: 3/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 3.3622, top1: 30.28%, top5: 55.08%, time: 9635.2s

Avg  loss: 2.7773, top1: 38.40%, top5: 65.36%

Best loss: 2.7773, top1: 38.40%, top5: 65.36%, epoch: 3

Epoch: 4/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.9959, top1: 36.21%, top5: 61.72%, time: 9636.2s

Avg  loss: 2.6380, top1: 40.47%, top5: 67.98%

Best loss: 2.6380, top1: 40.47%, top5: 67.98%, epoch: 4

Epoch: 5/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.8171, top1: 39.26%, top5: 64.87%, time: 9630.8s

Avg  loss: 2.5880, top1: 41.73%, top5: 68.91%

Best loss: 2.5880, top1: 41.73%, top5: 68.91%, epoch: 5

Epoch: 6/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.7149, top1: 41.07%, top5: 66.66%, time: 9640.7s

Avg  loss: 2.3805, top1: 45.68%, top5: 72.20%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 7/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.6456, top1: 42.30%, top5: 67.90%, time: 9639.8s

Avg  loss: 5.2944, top1: 13.36%, top5: 30.77%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 8/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5855, top1: 43.46%, top5: 68.86%, time: 9637.7s

Avg  loss: 14.9700, top1: 0.14%, top5: 0.49%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 9/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5401, top1: 44.36%, top5: 69.65%, time: 9642.6s

Avg  loss: 8.2867, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 10/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5080, top1: 45.03%, top5: 70.24%, time: 9633.9s

Avg  loss: 11.4105, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.

We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.

The text was updated successfully, but these errors were encountered:

yaozhewei · 2020-12-07T02:53:50Z

Hi Xuezhe,

Please let me know if the version in our branch can solve your problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wired behaviors of AdaHessian on ResNext-50 #11

Wired behaviors of AdaHessian on ResNext-50 #11

XuezheMax commented Nov 18, 2020 •

edited

Loading

yaozhewei commented Dec 7, 2020

Wired behaviors of AdaHessian on ResNext-50 #11

Wired behaviors of AdaHessian on ResNext-50 #11

Comments

XuezheMax commented Nov 18, 2020 • edited Loading

yaozhewei commented Dec 7, 2020

XuezheMax commented Nov 18, 2020 •

edited

Loading