Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Solutions with Theseus compared to Ceres. #499

Open
Muon2 opened this issue Apr 12, 2023 · 10 comments
Open

Incorrect Solutions with Theseus compared to Ceres. #499

Muon2 opened this issue Apr 12, 2023 · 10 comments

Comments

@Muon2
Copy link

Muon2 commented Apr 12, 2023

❓ Questions and Help

I'm currently using Ceres with the following options:

linear_solver_type = ceres::DENSE_NORMAL_CHOLESKY;
minimizer_type = TRUST_REGION;
trust_region_strategy_type = LEVENBERG_MARQUARDT;
line_search_direction_type = LBFGS;
line_search_type = WOLFE;
...

While Ceres is giving me exact solutions, I'm encountering issues when using Theseus as it cannot give me correct solutions. Here's the code I'm using with Theseus:

optimizer = th.LevenbergMarquardt(
    objective,
    linear_solver_cls=th.CholeskyDenseSolver,
    linearization_cls=th.DenseLinearization,
    linear_solver_kwargs={'check_singular': False},
    vectorize=True,
    max_iterations=50,
    step_size=1,
)

Do you have any ideas on what could be causing this issue? Is it possible that Theseus is not as efficient as Ceres?
I cannot give specific code, but as it can be seen in following pictures, ceres successfully gives the ground-truth solution of 0.2. I have tried my best to align the optimizer hyperparameters (such as iterations or numerical precisions), but I'm still not able to get the correct solution (only approximated) from Theseus.
Ceres's solution:
image
Theseus's solution:
image

@luisenp
Copy link
Contributor

luisenp commented Apr 12, 2023

Hi @Muon2. Thanks for issue. It's worth pointing out that we don't yet support line search, which makes direct comparison with Ceres more difficult. In this case, one possibility is that Theseus is jumping around the local optimum because of the large step size. Two easy things to try to start with are:

  1. Lowering the step size and increasing the number of iterations in the LM constructor call.
  2. Passing "adaptive_damping: True" to the optimizer_kwargs when you call the forward() method.

Also, can you add "verbose: True" and paste here the log outputs you get?

Thanks.

@Muon2
Copy link
Author

Muon2 commented Apr 13, 2023

Thanks for your advice. Here is the updated setting:

optimizer = th.LevenbergMarquardt(
    objective,
    linear_solver_cls=th.CholeskyDenseSolver,
    linearization_cls=th.DenseLinearization,
    linear_solver_kwargs={'check_singular': False},
    vectorize=True,
    max_iterations=1000,
    step_size=0.01,
    abs_err_tolerance = 1e-12,
    rel_err_tolerance = 1e-10,
)
...
with torch.no_grad():
    updated_inputs, info = theseus_optim.forward(theseus_inputs,
                                                 optimizer_kwargs={
                                                     "track_best_solution": True,
                                                     "verbose": True,
                                                     'adaptive_damping': True
                                                 })

Output:

/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_functorch/deprecated.py:80: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.jacrev is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use torch.func.jacrev instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('jacrev')
/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_functorch/deprecated.py:58: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
Nonlinear optimizer. Iteration: 0. Error: 0.0010422127207706806
Nonlinear optimizer. Iteration: 1. Error: 0.0010222213553004384
Nonlinear optimizer. Iteration: 2. Error: 0.001002057790090582
Nonlinear optimizer. Iteration: 3. Error: 0.0009822290825502863
Nonlinear optimizer. Iteration: 4. Error: 0.000962786557471795
Nonlinear optimizer. Iteration: 5. Error: 0.0009437290275676746
Nonlinear optimizer. Iteration: 6. Error: 0.000925049633503353
Nonlinear optimizer. Iteration: 7. Error: 0.0009067409116943864
Nonlinear optimizer. Iteration: 8. Error: 0.0008887955118850926
Nonlinear optimizer. Iteration: 9. Error: 0.0008712062297745013
Nonlinear optimizer. Iteration: 10. Error: 0.0008539660041216934
Nonlinear optimizer. Iteration: 11. Error: 0.000837067913908255
Nonlinear optimizer. Iteration: 12. Error: 0.0008205051755570392
Nonlinear optimizer. Iteration: 13. Error: 0.0008042711402055894
Nonlinear optimizer. Iteration: 14. Error: 0.0007883592910335395
Nonlinear optimizer. Iteration: 15. Error: 0.0007727632406426059
Nonlinear optimizer. Iteration: 16. Error: 0.000757476728488027
Nonlinear optimizer. Iteration: 17. Error: 0.0007424936183605924
Nonlinear optimizer. Iteration: 18. Error: 0.0007278078959179416
Nonlinear optimizer. Iteration: 19. Error: 0.0007134136662641566
Nonlinear optimizer. Iteration: 20. Error: 0.0006993051515765083
Nonlinear optimizer. Iteration: 21. Error: 0.0006854766887784247
Nonlinear optimizer. Iteration: 22. Error: 0.0006719227272574588
Nonlinear optimizer. Iteration: 23. Error: 0.0006586378266272316
Nonlinear optimizer. Iteration: 24. Error: 0.0006456166545323487
Nonlinear optimizer. Iteration: 25. Error: 0.0006328539844952056
Nonlinear optimizer. Iteration: 26. Error: 0.0006203446938033828
...
Nonlinear optimizer. Iteration: 997. Error: 6.596894190383936e-09
Nonlinear optimizer. Iteration: 998. Error: 6.571442528752883e-09
Nonlinear optimizer. Iteration: 999. Error: 6.556629966990791e-09
Nonlinear optimizer. Iteration: 1000. Error: 6.530363298906896e-09
Best solution: tensor([[1.2067e-01, 1.0061e-01, 1.5763e-01, 1.5327e-01, 1.8629e-01, 1.7980e-01,
         1.8828e-01, 1.6603e-01, 2.0621e-01, 2.0190e-01, 1.7889e-01, 1.5373e-01,
         1.2379e-02, 6.2110e-03, 3.1353e-02, 0.0000e+00, 1.0907e-01, 1.3611e-01,
         1.5076e-01, 1.5238e-01, 2.5867e-01, 2.6267e-01, 4.6731e-02, 2.3365e-02,
         4.0898e-02, 3.8744e-02, 0.0000e+00, 1.1120e-02, 2.8723e-02, 1.9866e-02,
         2.5063e-02, 2.5171e-02, 6.5149e-02, 8.6186e-02, 0.0000e+00, 6.3612e-02,
         1.2028e-01, 1.8336e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         3.5393e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 2.0549e-01, 2.0362e-01, 0.0000e+00,
         2.3224e-04, 2.1013e-01, 2.2653e-01, 0.0000e+00, 0.0000e+00, 5.9432e-02,
         8.6928e-02, 1.4458e-01, 1.2281e-01, 7.8795e-02, 8.5664e-02, 1.7547e-01,
         1.8110e-01, 2.3212e-01, 2.1093e-01, 1.9067e-01, 1.9089e-01, 9.7236e-02,
         4.5496e-02, 1.8003e-01, 1.9204e-01, 8.5821e-02, 1.0255e-01, 2.1708e-01,
         2.0921e-01, 1.6690e-01, 1.9001e-01, 1.2343e-03, 0.0000e+00, 4.4516e-02,
         4.8386e-02, 1.7181e-01, 1.3538e-01, 2.0551e-01, 1.9909e-01, 1.8031e-01,
         2.0654e-01, 9.4129e-02, 1.1386e-01, 1.4140e-01, 1.1618e-01, 1.8554e-01,
         1.4200e-01, 1.2141e-01, 1.3473e-01, 1.9721e-01, 1.9239e-01, 1.5729e-01,
         1.8359e-01, 6.0573e-02, 5.1711e-02, 6.2819e-02, 7.1884e-02, 2.0113e-01,
         1.9105e-01, 2.1986e-01, 2.0703e-01, 6.8948e-02, 7.3220e-02, 2.2587e-02,
         8.0984e-02, 2.8575e-01, 2.4508e-01, 2.5235e-01, 2.3108e-01, 1.2910e-02,
         1.8622e-02, 2.5956e-02, 2.9515e-02, 2.5202e-02, 6.5059e-02, 6.6690e-02,
         7.4381e-02, 6.0721e-02, 6.7487e-02, 1.2403e-01, 1.7885e-01, 1.9871e-01,
         1.4261e-01, 1.7522e-01, 1.9300e-01, 0.0000e+00, 0.0000e+00, 3.0852e-02,
         2.6307e-02, 5.6983e-02, 4.0329e-02, 0.0000e+00, 0.0000e+00, 3.8374e-02,
         3.4696e-02, 8.6945e-02, 7.2741e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 7.8142e-02, 4.1291e-02, 1.3528e-01, 1.2924e-01, 1.8291e-02,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 1.7563e-01, 1.7723e-01, 1.1552e-01,
         1.2811e-01, 3.9329e-02, 6.8840e-02, 5.6913e-02, 8.4606e-02, 5.1716e-02,
         8.0458e-03, 0.0000e+00, 0.0000e+00, 3.2180e-02, 4.8571e-02, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 4.8851e-03, 3.4442e-02, 1.8920e-02, 5.6717e-02,
         2.2043e-02, 1.5284e-02, 4.1004e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 1.8279e-01, 2.1345e-01, 2.1420e-01, 1.9960e-01,
         2.6176e-01, 4.2549e-02, 9.5060e-02, 1.7921e-01, 2.2926e-01, 7.2069e-02,
         4.8288e-02, 3.1475e-02, 2.8571e-02, 5.3800e-02, 1.9706e-01, 2.1428e-01,
         5.1281e-02, 1.5674e-01, 2.0720e-01, 1.7855e-01, 1.3070e-01, 1.7558e-01,
         0.0000e+00, 1.1966e-01, 3.4206e-02, 3.4872e-02, 1.4749e-01, 2.3156e-01,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.4715e-01, 1.2651e-01, 1.7327e-01, 1.7597e-01, 1.2417e-01, 1.9807e-01,
         1.7696e-02, 2.1160e-01, 2.2127e-01, 1.0615e-01, 2.2034e-01, 2.3861e-01,
         3.2378e-02, 2.1747e-01, 1.9784e-01, 5.5160e-02, 1.9465e-01, 1.7218e-01]])
tensor(0., device='cuda:0')

Although Theseus gets closer to the ground-truth solution, it requires much more iterations compared to Ceres, which achieves exact zero error with only a FEW steps (~5).

@xphsean12
Copy link

I also encountered the similar problem

@luisenp
Copy link
Contributor

luisenp commented Apr 13, 2023

As mentioned above, please keep in mind that we haven't yet added support for line search methods, so we only have basic control over the step sizes taken by the optimizers. It's expected that Ceres will be much more efficient, considering that it provides many sophisticated line search methods that can have huge impact in performance.

Also, keep in mind that Theseus is not meant as a replacement for Ceres. Our target applications are models that need differentiable optimization layers as part of larger neural architectures. In such cases, optimization accuracy might be less important, since the parameters of your optimization problem constantly change throughout learning.

That being said, if this error is problematic for your application we can consider increasing priority for adding some form of line search. But if you are concerned about potential bugs and just want to validate with Ceres, you could try disabling line search in Ceres (@fantaosha might have some pointers here), and see how we compare. We would definitely be interested in knowing the outcome of this comparison!

@luisenp
Copy link
Contributor

luisenp commented Apr 13, 2023

Also, looking at your logs, I'd say using both step_size=0.01 and adaptive_damping: True would probably make optimization really slow. I would start by turning adaptive damping off and try a few different step sizes (e.g., 0.01, 0.03, 0.1, 0.3) to see what works best for that setting. And with adaptive damping turned on you can set the step size to 1.0 and play a bit with the damping parameter. Don't expect this to be as good as Ceres, but I think you should be able to do better than what you are currently getting.

@Muon2
Copy link
Author

Muon2 commented Apr 14, 2023

As mentioned above, please keep in mind that we haven't yet added support for line search methods, so we only have basic control over the step sizes taken by the optimizers. It's expected that Ceres will be much more efficient, considering that it provides many sophisticated line search methods that can have huge impact in performance.

Also, keep in mind that Theseus is not meant as a replacement for Ceres. Our target applications are models that need differentiable optimization layers as part of larger neural architectures. In such cases, optimization accuracy might be less important, since the parameters of your optimization problem constantly change throughout learning.

That being said, if this error is problematic for your application we can consider increasing priority for adding some form of line search. But if you are concerned about potential bugs and just want to validate with Ceres, you could try disabling line search in Ceres (@fantaosha might have some pointers here), and see how we compare. We would definitely be interested in knowing the outcome of this comparison!

Thank you for your reply. I'm currently trying to use Theseus in neural networks, and optimization accuracy is crucial in my setting because my forward function is sensitive to the input number (it's related to skinning in computer graphics). If Theseus cannot achieve similar accuracy, it will result in unacceptable geometry artifacts. I'm looking forward some form of line search so that I can fully differentiate my pipeline.

@luisenp
Copy link
Contributor

luisenp commented Apr 14, 2023

Got it, thanks for clarifying. Did you try any of the suggestions in my last post? Curious if any combination of hyperparameters would suffice for your application.

@luisenp
Copy link
Contributor

luisenp commented May 8, 2023

@Muon2 did you notice any improvements with other optimization parameters?

@Muon2
Copy link
Author

Muon2 commented May 10, 2023

@luisenp, sorry for the late reply. I tried multiple combinations of hyperparameters, but it still requires a lot of time and does not converge to an acceptable precision. I will keep an eye on following updates.

@tvercaut
Copy link

I also stumbled onto this issue while trying to port the basic example from scipy:
https://scipy-cookbook.readthedocs.io/items/robust_regression.html

I wasn't expecting to have to tuning any hyperparameters to get this to converge. It might be good to add a note on this in the main README until a line search is implemented (I guess related to #153).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants