Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClimaTimeSteppers to 0.8.1 #3542

Merged
merged 2 commits into from
Jan 30, 2025
Merged

Update ClimaTimeSteppers to 0.8.1 #3542

merged 2 commits into from
Jan 30, 2025

Conversation

dennisYatunin
Copy link
Member

@dennisYatunin dennisYatunin commented Jan 25, 2025

Purpose

This PR updates ClimaTimeSteppers to version 0.8.1, which differs from 0.7.40 in that the post_explicit! and post_implicit! callbacks are renamed to cache! and cache_imp!. In the future, we can explore splitting the cache in a similar manner to the autodiff debugging PR.


  • I have read and checked the items on the review checklist.

@dennisYatunin dennisYatunin requested a review from szy21 January 25, 2025 03:51
@szy21
Copy link
Member

szy21 commented Jan 25, 2025

Some jobs failed because of the zero function: e.g. https://buildkite.com/clima/climaatmos-ci/builds/22412#01949b97-4f3c-477b-8ea5-b0a0e49491d7.

@charleskawczynski
Copy link
Member

Some jobs failed because of the zero function: e.g. https://buildkite.com/clima/climaatmos-ci/builds/22412#01949b97-4f3c-477b-8ea5-b0a0e49491d7.

Xref.

I can open a new PR in ClimaCore, but we’ll need another release

@charleskawczynski
Copy link
Member

We need to upgrade to the latest ClimaCore, too.

@szy21
Copy link
Member

szy21 commented Jan 25, 2025

Some jobs failed because of the zero function: e.g. https://buildkite.com/clima/climaatmos-ci/builds/22412#01949b97-4f3c-477b-8ea5-b0a0e49491d7.

Xref.

I can open a new PR in ClimaCore, but we’ll need another release

Sounds good, thanks!

@charleskawczynski
Copy link
Member

Actually, it turned out that we just weren't using the latest ClimaCore, which is required for the latest version of ClimaTimesteppers. I'm upgrading all of the dependencies in #3543.

@charleskawczynski
Copy link
Member

We could alternatively do that here, but it might be nice to confirm that nothing else is behavior changing, or at least merge the PRs separately.

@szy21
Copy link
Member

szy21 commented Jan 25, 2025

Actually, it turned out that we just weren't using the latest ClimaCore, which is required for the latest version of ClimaTimesteppers. I'm upgrading all of the dependencies in #3543.

Do we still need to define zero here after updating the ClimaCore?

@charleskawczynski
Copy link
Member

Do we still need to define zero here after updating the ClimaCore?

Yes, that’s still needed.

@dennisYatunin
Copy link
Member Author

Testing GPU longruns here: https://buildkite.com/clima/climaatmos-gpulongruns/builds/479

@szy21
Copy link
Member

szy21 commented Jan 26, 2025

Any idea why only one reproducibility test fails? I would assume this changes MSE for all jobs?

@szy21
Copy link
Member

szy21 commented Jan 26, 2025

One of the longruns that was stable before breaks: https://buildkite.com/clima/climaatmos-gpulongruns/builds/479#0194a03f-a47b-4279-80ad-3b078615026d. I still want to get this in though because this is much better than what we currently have (with ClimaTimeSteppers 0.7.39).

@szy21
Copy link
Member

szy21 commented Jan 27, 2025

I'm still confused about the reproducibility test though. For example, It doesn't change the MSE for the moist baroclinic wave in the ci, but changes the stability of the moist baroclinic wave in the longrun, which is just a higher resolution version of the ci job.

@dennisYatunin
Copy link
Member Author

I'm not sure why this change would hurt longrun stability, but my impression is that it will only have a significant effect when the implicit tendency is large or there are many Newton iterations. If those conditions aren't satisfied, it might take a very long time for the conservation errors at element boundaries to build up? I think the only way to know for sure if something is wrong with the way we do DSS in v0.8.0 is to try it out with the analytic mountain test next week.

For now, though, do you think it's safe to bump the ref counter and version number and merge this in? It's definitely more stable than with v0.7.39.

@szy21
Copy link
Member

szy21 commented Jan 27, 2025

Yes, I'm fine with merging this in. One GPU job keeps running out of memory though.

@dennisYatunin
Copy link
Member Author

I think that might be related to one of @charleskawczynski's recent ClimaCore changes? It doesn't seem to be too much of a memory increase, so for now I'll just double the memory allocation for that job.

@charleskawczynski
Copy link
Member

I think that might be related to one of @charleskawczynski's recent ClimaCore changes? It doesn't seem to be too much of a memory increase, so for now I'll just double the memory allocation for that job.

I don't think so, the ClimaCore updates were already merged in #3543, and that didn't require a ref counter increment.

I'm fine with bumping the ref counter if CTS 0.8.0 is more stable.

Any idea why only one reproducibility test fails? I would assume this changes MSE for all jobs?

This is definitely puzzling. I would expect changes for all the sphere cases:

  • no lim baroclinic wave (ρe) equilmoist
  • no lim baroclinic wave (ρe) equilmoist (deep sphere)
  • no lim held suarez (ρe) equilmoist hightop sponge
  • aquaplanet (ρe_tot) equil allsky monin_obukhov varying insol gravity wave (gfdl_restart) high top 1-moment
  • aquaplanet (ρe_tot) equil allsky monin_obukhov varying insol gravity wave (raw_topo) high top zonally asymmetric
  • Diagnostic EDMFX aquaplanet with TKE

I think I would need to review the CTS dss PR more carefully to understand how things are changing.

@szy21
Copy link
Member

szy21 commented Jan 27, 2025

Any idea why only one reproducibility test fails? I would assume this changes MSE for all jobs?

This is definitely puzzling. I would expect changes for all the sphere cases:

  • no lim baroclinic wave (ρe) equilmoist
  • no lim baroclinic wave (ρe) equilmoist (deep sphere)
  • no lim held suarez (ρe) equilmoist hightop sponge
  • aquaplanet (ρe_tot) equil allsky monin_obukhov varying insol gravity wave (gfdl_restart) high top 1-moment
  • aquaplanet (ρe_tot) equil allsky monin_obukhov varying insol gravity wave (raw_topo) high top zonally asymmetric
  • Diagnostic EDMFX aquaplanet with TKE

I think I would need to review the CTS dss PR more carefully to understand how things are changing.

Ah, I think one difference is that only Diagnostic EDMFX aquaplanet with TKE is run with ARS343. All others are run with SSPKnoth. Could that be the reason?

@charleskawczynski
Copy link
Member

Ah, I think one difference is that only Diagnostic EDMFX aquaplanet with TKE is run with ARS343. All others are run with SSPKnoth. Could that be the reason?

That definitely seems relevant. I'd bet so.

@dennisYatunin
Copy link
Member Author

Hmm, if the issue isn't with ClimaCore, I'm a bit surprised that the ClimaTimeSteppers update would cause a memory increase. All I did in the switch to v0.8.0 was rename some things and add an extra call to DSS, and I changed the order of some computations. Maybe the modifications to the ClimaTimeSteppers.update_stage! method made its LLVM code longer, and that's causing slightly more memory overhead during compilation? Regardless, in the interest of moving things along I'll just increase that job's allocation for now.

As for your observation, @szy21, you're right the lack of MSE changes is because Rosenbrock didn't change from v0.7.39 to v0.8.0. Do the current Rosenbrock results look correct to you? I don't think we need an extra DSS call there, but I also don't have a very good intuition for the Rosenbrock method.

In addition, I've thought of a slight variation to the ARK algorithm from v0.8.0. I'm currently testing it out with the GPU longruns here: https://buildkite.com/clima/climaatmos-gpulongruns/builds/481. If it looks more stable than the current version, we can make one final update to ClimaTimeSteppers.

@szy21
Copy link
Member

szy21 commented Jan 27, 2025

As for your observation, @szy21, you're right the lack of MSE changes is because Rosenbrock didn't change from v0.7.39 to v0.8.0. Do the current Rosenbrock results look correct to you? I don't think we need an extra DSS call there, but I also don't have a very good intuition for the Rosenbrock method.

They look good to me. But given we are running dycore simulations with ARS, I will change the ci to use ARS as well. I will do that after this PR is in.

In addition, I've thought of a slight variation to the ARK algorithm from v0.8.0. I'm currently testing it out with the GPU longruns here: https://buildkite.com/clima/climaatmos-gpulongruns/builds/481. If it looks more stable than the current version, we can make one final update to ClimaTimeSteppers.

The one job that failed before hasn't failed yet, so I guess this version is slightly more stable.

@dennisYatunin dennisYatunin changed the title Update ClimaTimeSteppers to 0.8.0 Update ClimaTimeSteppers to 0.8.1 Jan 28, 2025
@dennisYatunin
Copy link
Member Author

PR has been updated to use 0.8.1, whose GPU longruns were tested here: https://buildkite.com/clima/climaatmos-gpulongruns/builds/481

I'm going to wait until #3548 is done so that coupler work isn't delayed, and then I'll rebase and merge this in.

@trontrytel
Copy link
Member

Can this be merged now?

@dennisYatunin dennisYatunin added this pull request to the merge queue Jan 30, 2025
@Sbozzolo
Copy link
Member

Merging this manually (#3556 should fix the problem with the non responding test)

@Sbozzolo Sbozzolo removed this pull request from the merge queue due to a manual request Jan 30, 2025
@Sbozzolo Sbozzolo merged commit 2612ba7 into main Jan 30, 2025
16 of 19 checks passed
@Sbozzolo Sbozzolo deleted the dy/update_cts branch January 30, 2025 01:56
@trontrytel
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants