Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planning explicit dependency on Zarr v3 #392

Open
4 tasks
sharkinsspatial opened this issue Jan 27, 2025 · 6 comments
Open
4 tasks

Planning explicit dependency on Zarr v3 #392

sharkinsspatial opened this issue Jan 27, 2025 · 6 comments
Labels
dependencies Updates a dependency v3-migration Required for migration to Zarr-Python 3.0 zarr-python Relevant to zarr-python upstream

Comments

@sharkinsspatial
Copy link
Collaborator

sharkinsspatial commented Jan 27, 2025

We have a few interdependent issues which discuss the next steps for completely migrating to Zarr v3 but I thought it may be helpful to outline a rough plan of attack which consolidates all of these issues in one spot so everyone can contribute to the discussion on how to achieve this. I'll list things sequentially, but several of these may need to be approached in a single PR due to dependency conflicts.

  • Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175 Replace our existing Zarray metadata representation with the new zarr-python v3 class. This change will touch a significant portion of the codebase as our serialization/deserialization logic and readers all rely on the existing metadata structure.
  • Use in-memory icechunk stores in roundtrip tests #376 Replace kerchunk in roundtrip tests with in-memory Icechunk stores. This would likely be made simpler by first updating our Zarray representation to remove some of the v3 specific logic needed when Icechunk support was introduced.
  • Switch tests to use HDF reader instead of kerchunk-based HDF5 reader #374 Switch our HDF5 roundtrip test to use the HDFVirtualBackend. Along with the previous steps, this would remove the bulk of our dependence on kerchunk and free us to depend on Zarr v3 explicitly with only a limited loss of existing functionality. There are several outstanding issues (and likely more) with the HDFVirtualBackend that were discovered and raised during the Pangeo hack day in December that will need to be addressed. I'll link to a separate issue with the plan for tackling these.
  • Vendor kerchunk readers? #377 Copy remaining kerchunk readers to Virtualizarr. To avoid losing support for netCDF3 and FITS we can temporarily copy and paste these kerchunk readers into Virtualizarr and use the dataset->kerchunk reader->virtualizarr kerchunk reader approach until dedicated Virtualizarr readers can be developed for these formats.

@TomNicholas It would be great to get your feedback on how you think these steps should be organized into PRs so that we can make manageable changes but still execute the necessary parts of the test suite. @abarciauskas-bgse and I have some availability now to start tackling #17 and I'm going to begin working to stabilize HDFVirtualBackend so that it is hopefully robust enough for the majority of current use cases.

@maxrjones
Copy link
Member

I'd also be glad to contribute to this effort. I'd like to confirm the goals:

  1. Support only Zarr-Python>=3 as a required dependency
  2. Produce zarr_format=3 compliant virtual stores.

If I'm correct in both being goals, I think we'll also need to figure out what to do with the kerchunk writer since that's based on zarr_format=2.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 28, 2025

Thanks for raising this @sharkinsspatial, I was planning to raise an issue like this myself.

how you think these steps should be organized into PRs

I think that your bullets roughly correspond to the PRs we need, but we should actually do the steps in the opposite order listed.

The dependency on kerchunk can be re-introduced at any point, once a v3-compatible release of kerchunk becomes available.

However, I want to have this all done by the time Icechunk releases 1.0 (~late Feb), which means we can't really wait around for kerchunk.

Once we have stability in the dependencies and in the external API we can work on refactors at our leisure:

  1. Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175 Replace our existing Zarray metadata representation with the new zarr-python v3 class. This is technically optional - we never actually have to change this internal detail, but we should because we're currently duplicating logic. We can only change this with zarr-python pinned to >=3.0.0,and it will be easiest to change this starting from all green tests.
  2. Re-implement loadable_variables using FunctionalStore. (see Loading data from ManifestArrays without saving references to disk first #124)
  3. Refactor some more tests as described in Refactor tests around expected properties #394. The aim would be to make it clearer which tests are testing readers vs manifest internals vs integration, and what the definition of correct is (i.e. reader-specific tests comparing results of loading a file format against those from loading the same file format using an existing xarray backend).

That should put us on a solid ground for future feature additions, with everything passing with released versions of all dependencies, and a clearer separation of responsibilities within the test suite.

Any issues with the HDF reader can be tackled as necessary. It would be great to have another meta-issue to track those @sharkinsspatial. But I think that none of those fixes are actually pre-requisites to this plan, as there are currently no tests that pass with kerchunk's HDF5 reader but fail with your HDF reader.


@maxrjones yes we want to only support zarr-python>=3.0.0.

produce zarr_format=3-compliant virtual stores.

Zarr-python v3 supports reading from v2-compliant stores. But Icechunk only obeys the zarr v3 spec, so it's a non-issue for icechunk virtual stores.

For kerchunk that's a good point - right now a v3-compliant kerchunk store I guess is undefined, as kerchunk itself does not support v3. So the kerchunk writer will have to raise if zarr_format=3 is passed to it.

Whether or not we should support writing to zarr_format=2 kerchunk format is less obvious to me. We certainly could do that, that's what we already have support for. But once all the other dependencies are sorted out then there is much less reason to even need that feature.

@maxrjones maxrjones mentioned this issue Jan 29, 2025
@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream dependencies Updates a dependency v3-migration Required for migration to Zarr-Python 3.0 labels Jan 29, 2025
@maxrjones
Copy link
Member

FYI I created a project board because I was having difficulties remembering the various dependency conflicts (especially with the alpha icechunk releases). The README will track the dependencies for Kerchunk, Icechunk, and VirtualiZarr, and the phases are broken down according to Tom's outline above. Here's the project board - https://github.com/orgs/zarr-developers/projects/6. I'd need a organization owner to make it public but hopefully everyone has access to the Zarr-Python org to see the board🤞

Image

@TomNicholas
Copy link
Member

Thanks @maxrjones!

@joshmoore I don't believe I have the necessary permissions to make this project public either - would you mind awfully either doing that (or better yet giving me such permissions)?

@joshmoore
Copy link
Member

👍 on figuring out the permissions. Pinged on zulip for a chat.

@joshmoore
Copy link
Member

Update: GitHub permissions require organization admin status to make projects public (😦) The same permissions are needed to add people to a team if they aren't in the org.

There appear to be a number of workarounds for pieces of this (like this app to add people to the org)

We set the project permission together and double checked team membership. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Updates a dependency v3-migration Required for migration to Zarr-Python 3.0 zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

No branches or pull requests

4 participants