Questions regarding training parameters #4

hetong007 · 2018-05-04T22:41:59Z

First of all thank you for providing the training script and parameters about MobileNetV2 (the first repo I've ever seen).

I'm reproducing it for GluonCV thus have a couple of questions regarding the training:

How did you decide to set the number of epoch to 480 and batch size to 160?
Have you tried to train other MobileNetV2, i.e. 0.75, 0.5.
- I tried to train 0.75 with the same parameter, but it is 1.5% worse than tensorflow's official model: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
Have you found a significant difference between training with/without your PR for nnvm?

I appreciate your help with my questions.

The text was updated successfully, but these errors were encountered:

liangfu · 2018-05-05T07:39:44Z

Regarding to your questions,

I refer to resnet , which has been reproduced in this repo, for data augmentation. I use 480xN because resnet training can be reproduced that way, and batch size of 160 is the maximum size that two GTX-1080 can handle.
The training with multiplier=1.4 is undergoing, and I think I would release results with 0.75 and 0.5 as well. I don’t have the precision result at the moment, but I think you can reproduce the results listed in the link as long as you handle data augmentation and learning rate correctly.
The PR is just for inference code with nnvm, which lacks of the clip operator in mxnet frontend . There is no relation to prediction precision.

Hope the above notes can help you with your adventures with MNetV2.

hetong007 · 2018-05-05T07:50:12Z

Thank you so much for the quick reply!

I am curious about the num_epochs = 480 at: https://github.com/liangfu/mxnet-mobilenet-v2/blob/master/train_imagenet.py#L56 . What makes you decide to train the model for 480 epochs?

By using your settings, I can reach 71.7% with multiplier=1.0. With the same setting, however, I'm not getting too close to the claimed 69.8% ( so far 68.7% at epoch 200). Will double check my setting, and look forward to your result for 1.4!

liangfu · 2018-05-05T09:28:43Z

Good question, your success is just around the corner! At epoch around 200, I turned augmentation level to 3, and random scale between 0.533 and 0.6, this step fine-tunes the network to focus on the the specific region and prevents over fitting. After 30 to 40 epochs, I turned the aug_level to 1, and set random scale range between 0.533 and 0.535. Then you would reproduce the result.

You can forget the ‘num_epoch=480’, I was just trying to set an infinite value while avoid making the server running excessively long. I think I might upload the training log, which might be more intuitive to illustrate the argument settings.

hetong007 · 2018-05-05T22:19:59Z

For multiplier=1.0, I didn't change the augmentation and still gets to 71.7%.

But since I failed with 0.75, I'll try your augmentation approach. Thanks again for sharing!

liangfu · 2018-05-06T02:43:49Z

That sounds great, but that might consume a long time for training I guess. How many epochs you got until it converge to 71.7?

hetong007 · 2018-05-06T02:49:32Z

With the 80*2 batch size, this script hit 71.8% at the 261-th epoch.

I'll let it run through the entire 480 epochs and publish the model and training logs to GluonCV.

liangfu · 2018-05-06T09:43:24Z

Thank you for sharing. I would change my training strategy and try again later.

I still think even after your converge to 71.8 without changing aug_level, I suggest try changing augmentation level and random scale range I referred previously, which is really effective at the very end of the training stage.

hetong007 · 2018-05-06T16:56:04Z

Yes I'm quite interested in seeing its effect, will definitely resume and try that out, after my training with 0.75.

AIROBOTAI · 2018-05-10T12:34:35Z

@liangfu Great work! Could you please share the training log?

liangfu · 2018-05-16T10:46:47Z

Training logs have been uploaded, please look into the log folder.

AIROBOTAI · 2018-05-16T11:38:40Z

@liangfu Thanks for sharing! I checked the logs, for multiplier=1.0, it achieves 71.7, for multiplier=1.4, it achieves 73.0. The reported numbers in the original paper are 72.0 and 74.7 respectively. Any idea how to match the reported numbers? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding training parameters #4

Questions regarding training parameters #4

hetong007 commented May 4, 2018

liangfu commented May 5, 2018 •

edited

Loading

hetong007 commented May 5, 2018

liangfu commented May 5, 2018 •

edited

Loading

hetong007 commented May 5, 2018

liangfu commented May 6, 2018

hetong007 commented May 6, 2018 •

edited

Loading

liangfu commented May 6, 2018 •

edited

Loading

hetong007 commented May 6, 2018

AIROBOTAI commented May 10, 2018

liangfu commented May 16, 2018

AIROBOTAI commented May 16, 2018

Questions regarding training parameters #4

Questions regarding training parameters #4

Comments

hetong007 commented May 4, 2018

liangfu commented May 5, 2018 • edited Loading

hetong007 commented May 5, 2018

liangfu commented May 5, 2018 • edited Loading

hetong007 commented May 5, 2018

liangfu commented May 6, 2018

hetong007 commented May 6, 2018 • edited Loading

liangfu commented May 6, 2018 • edited Loading

hetong007 commented May 6, 2018

AIROBOTAI commented May 10, 2018

liangfu commented May 16, 2018

AIROBOTAI commented May 16, 2018

liangfu commented May 5, 2018 •

edited

Loading

liangfu commented May 5, 2018 •

edited

Loading

hetong007 commented May 6, 2018 •

edited

Loading

liangfu commented May 6, 2018 •

edited

Loading