Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Averaged versions need to use larger learning rates. #4

Open
ptoulis opened this issue Dec 8, 2014 · 5 comments
Open

Averaged versions need to use larger learning rates. #4

ptoulis opened this issue Dec 8, 2014 · 5 comments
Labels

Comments

@ptoulis
Copy link
Contributor

ptoulis commented Dec 8, 2014

Also, experiment with optimal rate schedules for aiSGD?

@ptoulis ptoulis added the bug label Dec 8, 2014
@dustinvtran
Copy link
Member

Here are two quick plots following the learning rate defined in Ruppert:

alpha_n = Dn^(-alpha),

where alpha is in (1/2,1) and D is a chosen constant. I ran the following on the Poisson generated data from exp_poisson_n4p1.R, using n=1e5 observations instead and the above learning rate function. The first plot fixes D at 1 over the minimum eigenvalue of the Fisher information (1/0.01) and ranges alpha. Recall that in the univariate case, D=1/lambda and alpha=1 is optimal (optimal in the sense that it minimizes the trace of the asymptotic variance).

The second plot fixes D=1000 (an arbitrarily chosen value made to be greater than 1/0.01) and ranges alpha.
exp_poisson_n4p1_d1
exp_poisson_n4p1_d2

I plan to work on this more rigorously after more review of the proof and also some incorporation of real data sets to experiment on.

@ptoulis
Copy link
Contributor Author

ptoulis commented Dec 9, 2014

Great thanks. These are interesting results.
Were we expecting for alpha=1 for AI-SGD to be optimal?
I can see it for the non-averaged version, but i am not sure for the averaged one.

@ptoulis
Copy link
Contributor Author

ptoulis commented Dec 14, 2014

I have run a few tests locally on various models for AISGD.

it seems that the exact exponent in the learning rate depends crucially on the eigenvalues of the
fisher information. More generally, I have also seen that averaging is bad when non-averaged SGD is doing well -- this is not very surprising as averaging was introduced for slowly-converging approximations, however it is not often discussed in practice where everyone seems to think that averaging is best. Need to investigate this further.

@dustinvtran
Copy link
Member

I've been spending more time on this yesterday and today and cannot find a recognizable pattern. Will look into the papers again. Meanwhile, 3c328bb implements more code for running experiments. See theory/optimal_aisgd_experiments.R for examples using the tuned parameters compared against D=1/lambda0, alpha=-1.

@dustinvtran
Copy link
Member

On constant rates for Normal(0,A) data:
temp
par #1 corresponds to a constant learning rate of 0.0025 and par #2 corresponds to a constant learning rate of 0.005. AI-SGD continues to decrease while explicit SGD levels off. The latter is also seen in the MNIST example of Johnson and Zhang (2013), who remark on the same pattern for explicit SGD.

AI-SGD is also robust to larger learning rates; as I've increased it above 0.005, the curve is precisely the same as 0.005's, whereas explicit SGD continues to worsen due to instability.

Will post images for constant rates on MNIST data set once I finish implementing SVRG and SAG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants