Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOTE discussion w/ pablin #2

Open
4 of 6 tasks
tomMoral opened this issue Mar 25, 2022 · 4 comments
Open
4 of 6 tasks

NOTE discussion w/ pablin #2

tomMoral opened this issue Mar 25, 2022 · 4 comments

Comments

@tomMoral
Copy link
Member

tomMoral commented Mar 25, 2022

These are notes from our discussion w/ @pierreablin on the design of the benchmark for NN. Feel free to comment/add/edit stuff.

Critical steps for CIFAR10 training

There are a few critical steps to watch for good performance when training a neural net on CIFAR:

  • Data augmentation is often need. Training with and without it would be a plus.
  • The learning rate scheduler can be quite important. Making sure we have the same for the different framework is important.

A source of inspiration for these choice can be: kuangliu/pytorch-cifar.

Source of variation between framework

there are a few sources of variation between the framework that will be hard to control:

  • Implem of the transforms
  • Layer initializations
  • Architecture choices (not always the same spot for BatchNorm, different drop out, ...)

It is probably fine to not control them completly as this can highlight the differences in some design choices. But it is important to list them well in the paper.

Implementations to do

  • Comparable loss: it is fundamental that we make sure the loss is the same for all framework. A way to make sure we completely control this is to wrap all architecture in a class that provides a predict_proba(X: np.array) -> np.array function, that return the class probabilities for each samples. That way, we are sure we input the same think and we compute the loss the same.
  • GPU training: I plan to implement a multi GPU training using submitit. I think this should be my priority as this will impact all the chain. I will start from ENH add Parallelized benchmark benchopt#265 and improve.
  • Multi-framework dataset: a natural way to support datasets with different framework is to provide for each dataset a implementation on each supported framework. That way, we don't hack our way to convert a dataset loaded in pytorch to tf and there is no unfair advantage to one framework vs the other. This could for instance be controlled with a parameter framework and datasets of improper frameworks can be skipped with benchopt.BaseSolver.skip. To make the plot possible, see next point.
  • Multi-parameter plots: for some parameters in Objective/Dataset , one could want to plot all the curves in the same plot. For instance, this is the case with framework (see above) or if we use data augmentation or not. Another possiblity woulb be if we have several architectures. Technically, this is something to do at plot time, by simply changing the filtering of what to put on a plot. The big question is on the API side, how to tell which parameters are to be ignored when merging plot together.

Next step ?

If we have more time for this benchmark, a few ideas that we could try:

  • We could do a second benchmark with a single framework but different architectures (resnet18/34/50) to compare them easily.
  • ....
@tomMoral
Copy link
Member Author

@pierreablin
Copy link
Collaborator

Regarding " comparable loss " I don't think it can be made properly, because we need to ensure that the frameworks all differentiate wrt to the same loss. We can always compute the same loss for each model afterwards, but we cannot force them to optimize this loss using a predict_proba function.

Anyways the most important metric is arguably accuracy.

@zaccharieramzi
Copy link
Collaborator

zaccharieramzi commented May 2, 2022

One very interesting aspect of all the CIFAR10 scripts, is that they use the CIFAR10 version of the ResNet18 (and bigger), which basically doesn't downsample the images 4-fold in the first convolution.

In the Imagenet version, the first convolution is done to downsample the image 2-fold, and there is then a 2-fold maxpooling op.
None of this happens for the CIFAR version, as is explained in the paper (in the section 4.2).

As explained here, this means that the network only sees a 8x8 image from the very start.

As a side-note, the momentumnet ResNet also features this difference.

@tomMoral
Copy link
Member Author

tomMoral commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants