Averaged versions need to use larger learning rates. #4

ptoulis · 2014-12-08T23:53:57Z

Also, experiment with optimal rate schedules for aiSGD?

dustinvtran · 2014-12-09T09:46:45Z

Here are two quick plots following the learning rate defined in Ruppert:

alpha_n = Dn^(-alpha),

where alpha is in (1/2,1) and D is a chosen constant. I ran the following on the Poisson generated data from exp_poisson_n4p1.R, using n=1e5 observations instead and the above learning rate function. The first plot fixes D at 1 over the minimum eigenvalue of the Fisher information (1/0.01) and ranges alpha. Recall that in the univariate case, D=1/lambda and alpha=1 is optimal (optimal in the sense that it minimizes the trace of the asymptotic variance).

The second plot fixes D=1000 (an arbitrarily chosen value made to be greater than 1/0.01) and ranges alpha.

I plan to work on this more rigorously after more review of the proof and also some incorporation of real data sets to experiment on.

ptoulis · 2014-12-09T15:06:45Z

Great thanks. These are interesting results.
Were we expecting for alpha=1 for AI-SGD to be optimal?
I can see it for the non-averaged version, but i am not sure for the averaged one.

ptoulis · 2014-12-14T07:58:18Z

I have run a few tests locally on various models for AISGD.

it seems that the exact exponent in the learning rate depends crucially on the eigenvalues of the
fisher information. More generally, I have also seen that averaging is bad when non-averaged SGD is doing well -- this is not very surprising as averaging was introduced for slowly-converging approximations, however it is not often discussed in practice where everyone seems to think that averaging is best. Need to investigate this further.

dustinvtran · 2014-12-26T13:07:52Z

I've been spending more time on this yesterday and today and cannot find a recognizable pattern. Will look into the papers again. Meanwhile, 3c328bb implements more code for running experiments. See theory/optimal_aisgd_experiments.R for examples using the tuned parameters compared against D=1/lambda0, alpha=-1.

dustinvtran · 2014-12-30T19:24:56Z

On constant rates for Normal(0,A) data:

par #1 corresponds to a constant learning rate of 0.0025 and par #2 corresponds to a constant learning rate of 0.005. AI-SGD continues to decrease while explicit SGD levels off. The latter is also seen in the MNIST example of Johnson and Zhang (2013), who remark on the same pattern for explicit SGD.

AI-SGD is also robust to larger learning rates; as I've increased it above 0.005, the curve is precisely the same as 0.005's, whereas explicit SGD continues to worsen due to instability.

Will post images for constant rates on MNIST data set once I finish implementing SVRG and SAG.

ptoulis added the bug label Dec 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Averaged versions need to use larger learning rates. #4

Averaged versions need to use larger learning rates. #4

ptoulis commented Dec 8, 2014

dustinvtran commented Dec 9, 2014

ptoulis commented Dec 9, 2014

ptoulis commented Dec 14, 2014

dustinvtran commented Dec 26, 2014

dustinvtran commented Dec 30, 2014

Averaged versions need to use larger learning rates. #4

Averaged versions need to use larger learning rates. #4

Comments

ptoulis commented Dec 8, 2014

dustinvtran commented Dec 9, 2014

ptoulis commented Dec 9, 2014

ptoulis commented Dec 14, 2014

dustinvtran commented Dec 26, 2014

dustinvtran commented Dec 30, 2014