Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample saving for Site Dataset + presaved SiteDataset/DataModule for loading presaved samples #290

Merged
merged 25 commits into from
Jan 21, 2025

Conversation

Sukh-P
Copy link
Member

@Sukh-P Sukh-P commented Dec 17, 2024

Pull Request

Tackles #301

Description

Inclues:

  • Adding support for creating samples using the ocf-data-sampler Site torch dataset
  • Adding small torch Dataset for premade netcdf samples
  • Refactor Datamodules so there is a base datamodule and then SiteDatamodule which inherits from that (will move the UK regional Datamodule to this structure as well but it needs some updating to do with Batchkeys first, all this can be a separate PR)
  • Removing the datapipe specified datamodules which won't be used after moving to ocf-data-sampler
  • Small documentation updates in the README (may have missed some)

How Has This Been Tested?

Both the save_samples script and the loading premade samples have been tested locally (further changes required to actually get model training fully without errors)

Also added tests back in with new SiteDatamodule.

TODO Testing

  • Lots of the test and test data only work at the moment with the old BatchKey format, need to update them

pyproject.toml Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks all good. I wonder if we should od https://github.com/openclimatefix/ocf-data-sampler/issues/99 first, then add some test cases in here, just to be double sure

@Sukh-P Sukh-P marked this pull request as draft January 13, 2025 19:26
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 45.61404% with 31 lines in your changes missing coverage. Please review.

Please upload report for BASE (dev-data-sampler@430e1d7). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pvnet/data/base_datamodule.py 33.33% 22 Missing ⚠️
pvnet/data/site_datamodule.py 59.09% 9 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                 @@
##             dev-data-sampler     #290   +/-   ##
===================================================
  Coverage                    ?   59.67%           
===================================================
  Files                       ?       28           
  Lines                       ?     1860           
  Branches                    ?        0           
===================================================
  Hits                        ?     1110           
  Misses                      ?      750           
  Partials                    ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sukh-P Sukh-P changed the title Add sample saving for Site Dataset Add sample saving for Site Dataset + SiteDataset/DataModule for loading presaved samples Jan 14, 2025
@Sukh-P Sukh-P changed the title Add sample saving for Site Dataset + SiteDataset/DataModule for loading presaved samples Add sample saving for Site Dataset + presaved SiteDataset/DataModule for loading presaved samples Jan 14, 2025
@Sukh-P Sukh-P requested a review from peterdudfield January 14, 2025 17:06
@@ -6,7 +6,7 @@ dynamic = ["version", "readme"]
license={file="LICENCE"}

dependencies = [
"ocf_data_sampler==0.0.32",
"ocf_data_sampler==0.0.43",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this update in ocf-data-sampler version will break things from working properly on this dev branch for a bit before we get to fixing/updating everything.

I see two options:

  1. We don't merge this in until we get everything working, pros is that we don't need another branch, con is this PR will get massive (already fairly big)
  2. Perhaps can use another dev branch, the original is kept as the working dev branch for ocf-data-sampler in PVNet and the new one is the one which is the interim one whilst we get all things working. Pros and cons basically the opposite of 1.

The reason for the caution is if we were to merge this in as is I think we lose any remote code which allows ocf-data-sampler to work with PVNet which feels risky?

@peterdudfield @dfulu let me know what you think, may be other options too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I am being too paranoid and it is fine to merge this in as we have local copies of code that work (plus would be able to pull out working code from git history)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yea I can see the decision here. Get the balance right between small PRs, and getting things working properly. I wouldnt commmit to main, but i know this is only dev.

What is there to do in this bit of work to make sure it is working? Do you think it is just updating from BatchKeys to SampleKeys, a bit more?

Without knowing this: I would probably still merge, and then fix afterwards. Its not the main branch, and people can pull a different commit if they need to. It also allows someone to fix issue X, why you doing issue Y.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's mainly moving away from the old BatchKey enum to the SampleKeys we use now I believe, it's used in a few places in PVNet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the tests need updating because they have test data in the old BatchKey format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i think make a github issue with these things, and merge this one

Copy link
Member Author

@Sukh-P Sukh-P Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you happy to approve? No worries if you haven't had a chance to review yet

@Sukh-P Sukh-P requested a review from dfulu January 14, 2025 17:16
@Sukh-P Sukh-P marked this pull request as ready for review January 14, 2025 17:17
@Sukh-P Sukh-P merged commit 075ae31 into dev-data-sampler Jan 21, 2025
3 checks passed
@Sukh-P Sukh-P deleted the site-data-sampling branch January 21, 2025 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants