NOTE discussion w/ pablin #2

tomMoral · 2022-03-25T17:09:58Z

These are notes from our discussion w/ @pierreablin on the design of the benchmark for NN. Feel free to comment/add/edit stuff.

Critical steps for CIFAR10 training

There are a few critical steps to watch for good performance when training a neural net on CIFAR:

Data augmentation is often need. Training with and without it would be a plus.
The learning rate scheduler can be quite important. Making sure we have the same for the different framework is important.

A source of inspiration for these choice can be: kuangliu/pytorch-cifar.

Source of variation between framework

there are a few sources of variation between the framework that will be hard to control:

Implem of the transforms
Layer initializations
Architecture choices (not always the same spot for BatchNorm, different drop out, ...)

It is probably fine to not control them completly as this can highlight the differences in some design choices. But it is important to list them well in the paper.

Implementations to do

Comparable loss: it is fundamental that we make sure the loss is the same for all framework. A way to make sure we completely control this is to wrap all architecture in a class that provides a predict_proba(X: np.array) -> np.array function, that return the class probabilities for each samples. That way, we are sure we input the same think and we compute the loss the same.
GPU training: I plan to implement a multi GPU training using submitit. I think this should be my priority as this will impact all the chain. I will start from ENH add Parallelized benchmark benchopt#265 and improve.
Multi-framework dataset: a natural way to support datasets with different framework is to provide for each dataset a implementation on each supported framework. That way, we don't hack our way to convert a dataset loaded in pytorch to tf and there is no unfair advantage to one framework vs the other. This could for instance be controlled with a parameter framework and datasets of improper frameworks can be skipped with benchopt.BaseSolver.skip. To make the plot possible, see next point.
Multi-parameter plots: for some parameters in Objective/Dataset , one could want to plot all the curves in the same plot. For instance, this is the case with framework (see above) or if we use data augmentation or not. Another possiblity woulb be if we have several architectures. Technically, this is something to do at plot time, by simply changing the filtering of what to put on a plot. The big question is on the API side, how to tell which parameters are to be ignored when merging plot together.

Next step ?

If we have more time for this benchmark, a few ideas that we could try:

We could do a second benchmark with a single framework but different architectures (resnet18/34/50) to compare them easily.
....

The text was updated successfully, but these errors were encountered:

tomMoral · 2022-03-25T17:19:29Z

A jax implementation of ResNet18: https://dm-haiku.readthedocs.io/en/latest/api.html?highlight=resnet18#resnet18

pierreablin · 2022-04-06T00:16:59Z

Regarding " comparable loss " I don't think it can be made properly, because we need to ensure that the frameworks all differentiate wrt to the same loss. We can always compute the same loss for each model afterwards, but we cannot force them to optimize this loss using a predict_proba function.

Anyways the most important metric is arguably accuracy.

zaccharieramzi · 2022-05-02T12:38:02Z

One very interesting aspect of all the CIFAR10 scripts, is that they use the CIFAR10 version of the ResNet18 (and bigger), which basically doesn't downsample the images 4-fold in the first convolution.

In the Imagenet version, the first convolution is done to downsample the image 2-fold, and there is then a 2-fold maxpooling op.
None of this happens for the CIFAR version, as is explained in the paper (in the section 4.2).

As explained here, this means that the network only sees a 8x8 image from the very start.

As a side-note, the momentumnet ResNet also features this difference.

tomMoral · 2022-10-11T08:47:27Z

My point with the comparable loss is not for the training but for the computation of the objective function. Le mer. 6 avr. 2022 à 02:17, Pierre Ablin ***@***.***> a écrit :

…

Regarding " *comparable loss* " I don't think it can be made properly, because we need to ensure that the frameworks all *differentiate* wrt to the same loss. We can always compute the same loss for each model afterwards, but we cannot force them to optimize this loss using a predict_proba function. Anyways the most important metric is arguably accuracy. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZKZ6IPRKI4ILEYOV2TLL3VDTJYNANCNFSM5RUXEZSA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

zaccharieramzi mentioned this issue May 2, 2022

Removing the downsampling layers in Large ResNets to fit CIFAR setup #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTE discussion w/ pablin #2

NOTE discussion w/ pablin #2

tomMoral commented Mar 25, 2022 •

edited by zaccharieramzi

Loading

tomMoral commented Mar 25, 2022

pierreablin commented Apr 6, 2022

zaccharieramzi commented May 2, 2022 •

edited

Loading

tomMoral commented Oct 11, 2022 via email

NOTE discussion w/ pablin #2

NOTE discussion w/ pablin #2

Comments

tomMoral commented Mar 25, 2022 • edited by zaccharieramzi Loading

Critical steps for CIFAR10 training

Source of variation between framework

Implementations to do

Next step ?

tomMoral commented Mar 25, 2022

pierreablin commented Apr 6, 2022

zaccharieramzi commented May 2, 2022 • edited Loading

tomMoral commented Oct 11, 2022 via email

tomMoral commented Mar 25, 2022 •

edited by zaccharieramzi

Loading

zaccharieramzi commented May 2, 2022 •

edited

Loading