Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips and Tricks for training classification convolutional neural networks #11

Open
3 of 20 tasks
zaccharieramzi opened this issue Apr 20, 2022 · 3 comments
Open
3 of 20 tasks

Comments

@zaccharieramzi
Copy link
Collaborator

zaccharieramzi commented Apr 20, 2022

Data augmentation:

  • Random Resized Crop
  • Horizontal Flip
  • Random Augment (see code, paper ), with e.g. 7/0.5. Basically it is a fixed set of data augmentation functions.
  • Color Jitter
  • PCA lighting
  • Random Erasing
  • Mixup
  • Cutmix

Regularization:

  • Label smoothing (basically making the target of classif not (1, 0, 0 ,0) but (1-e, e/3, e/3, e/3)
  • Repeated Augmentation (multiple instances of a sample in the same batch, with different augmentation)
  • stochastic depth (basically drop some blocks and replace them with identity stochastically). It sounds very complicated to use as part of a solver in our setup. Moreover, the "ResNet strikes back" paper claims that this is only a good idea for very big networks, so it might not be our priority.
  • weight decay (according to "Bag of tricks" this might only be useful for non-BN params). I have seen somewhere (but don't remember where exactly) that it should also only be applied to the weights and not the biases.

Learning rate:

  • Warmup: scale linearly the learning rate from 0 to the initial in the first few (5) epochs.
  • Scheduling: step or cosine. The problem with the cosine schedule is that it needs to know the total number of epochs, which in the current setting is not available to the solver. Could we work around that @tomMoral ?

Modeling (to me these ones are out of our scope):

  • Zero gamma: make the initialization of the learned scale of BN layers to 0.
  • Layer Scale (basically learn a multiplicative factor per channel at the end of residual blocks)

Other:

  • mixed precision
  • weight averaging (SWA or EMA)
  • Binary Xent rather than categorical Xent, coupled with Cutmix and Mixup, in a 1-vs-all fashion
  • gradient clipping (not used in "ResNet strikes back")
@zaccharieramzi
Copy link
Collaborator Author

The highest prio to me is:

  • Crop
  • flip
  • schedule
  • WD
  • RandAug
  • mixed precision
  • Binary Xent (with Mixup cutmix)

@tomMoral @pierreablin wdyt?

@zaccharieramzi
Copy link
Collaborator Author

RE LR scheduling: there exists a big difference in how TF and PL implement it.
Basically, TF implements it on a per-optimizer-step level, and PL implements it on a per-epoch level (see here), which is similar that is done here or in timm.

I might just go with the PL way of doing it, since it's the least flexible.

@zaccharieramzi
Copy link
Collaborator Author

zaccharieramzi commented Apr 20, 2022

RE LR scheduling / Weight Decay: I am not sure what is the canonical way of updating the weight decay given the lr schedule.

In TF, it is specified that it should be updated, and in this case manually:

Note: when applying a decay to the learning rate, be sure to manually apply the decay to the weight_decay as well. For example:

But in PL or torch, I didn't see any mention of this update, so it might not be used. I am going to verify this.

EDIT

Ok so the problem with WD is actually the following, and I understood it reading the original decoupled weight decay paper as well as the docs of Adam and AdamW in PyTorch.
There exists 2 ways of applying the weight decay:

  • coupled: this is what PyTorch does off-the-shelf when using the classic operators (like here for Adam). This basically corresponds to L2 regularization of all the weights (with a factor 2 somewhere). For TensorFlow, this would correspond to adding an L2 regularization of half the value.
  • decoupled: this is what both TF and PT do with AdamW and SGDW. According to the decoupled weight decay paper it is better than coupling in the case of adaptive gradients.

I will call both types clearly in the solvers.
We can still have coupled weight decay for PyTorch and TensorFlow, but for TensorFlow, the problem is that we need to hack it in a bit of an ugly way... I will make a proposal and we will see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant