Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding training parameters #4

Open
hetong007 opened this issue May 4, 2018 · 11 comments
Open

Questions regarding training parameters #4

hetong007 opened this issue May 4, 2018 · 11 comments

Comments

@hetong007
Copy link

First of all thank you for providing the training script and parameters about MobileNetV2 (the first repo I've ever seen).

I'm reproducing it for GluonCV thus have a couple of questions regarding the training:

  1. How did you decide to set the number of epoch to 480 and batch size to 160?
  2. Have you tried to train other MobileNetV2, i.e. 0.75, 0.5.
  3. Have you found a significant difference between training with/without your PR for nnvm?

I appreciate your help with my questions.

@liangfu
Copy link
Owner

liangfu commented May 5, 2018

Regarding to your questions,

  1. I refer to resnet , which has been reproduced in this repo, for data augmentation. I use 480xN because resnet training can be reproduced that way, and batch size of 160 is the maximum size that two GTX-1080 can handle.
  2. The training with multiplier=1.4 is undergoing, and I think I would release results with 0.75 and 0.5 as well. I don’t have the precision result at the moment, but I think you can reproduce the results listed in the link as long as you handle data augmentation and learning rate correctly.
  3. The PR is just for inference code with nnvm, which lacks of the clip operator in mxnet frontend . There is no relation to prediction precision.

Hope the above notes can help you with your adventures with MNetV2.

@hetong007
Copy link
Author

Thank you so much for the quick reply!

I am curious about the num_epochs = 480 at: https://github.com/liangfu/mxnet-mobilenet-v2/blob/master/train_imagenet.py#L56 . What makes you decide to train the model for 480 epochs?

By using your settings, I can reach 71.7% with multiplier=1.0. With the same setting, however, I'm not getting too close to the claimed 69.8% ( so far 68.7% at epoch 200). Will double check my setting, and look forward to your result for 1.4!

@liangfu
Copy link
Owner

liangfu commented May 5, 2018

Good question, your success is just around the corner! At epoch around 200, I turned augmentation level to 3, and random scale between 0.533 and 0.6, this step fine-tunes the network to focus on the the specific region and prevents over fitting. After 30 to 40 epochs, I turned the aug_level to 1, and set random scale range between 0.533 and 0.535. Then you would reproduce the result.

You can forget the ‘num_epoch=480’, I was just trying to set an infinite value while avoid making the server running excessively long. I think I might upload the training log, which might be more intuitive to illustrate the argument settings.

@hetong007
Copy link
Author

For multiplier=1.0, I didn't change the augmentation and still gets to 71.7%.

But since I failed with 0.75, I'll try your augmentation approach. Thanks again for sharing!

@liangfu
Copy link
Owner

liangfu commented May 6, 2018

That sounds great, but that might consume a long time for training I guess. How many epochs you got until it converge to 71.7?

@hetong007
Copy link
Author

hetong007 commented May 6, 2018

With the 80*2 batch size, this script hit 71.8% at the 261-th epoch.

I'll let it run through the entire 480 epochs and publish the model and training logs to GluonCV.

@liangfu
Copy link
Owner

liangfu commented May 6, 2018

Thank you for sharing. I would change my training strategy and try again later.

I still think even after your converge to 71.8 without changing aug_level, I suggest try changing augmentation level and random scale range I referred previously, which is really effective at the very end of the training stage.

@hetong007
Copy link
Author

Yes I'm quite interested in seeing its effect, will definitely resume and try that out, after my training with 0.75.

@AIROBOTAI
Copy link

@liangfu Great work! Could you please share the training log?

@liangfu
Copy link
Owner

liangfu commented May 16, 2018

Training logs have been uploaded, please look into the log folder.

@AIROBOTAI
Copy link

@liangfu Thanks for sharing! I checked the logs, for multiplier=1.0, it achieves 71.7, for multiplier=1.4, it achieves 73.0. The reported numbers in the original paper are 72.0 and 74.7 respectively. Any idea how to match the reported numbers? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants