Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.array from from an existing zarr.Array #2622

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

brokkoli71
Copy link
Member

@brokkoli71 brokkoli71 commented Jan 2, 2025

added concurrent streaming of source array into new array

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@brokkoli71 brokkoli71 marked this pull request as draft January 2, 2025 16:54
@brokkoli71
Copy link
Member Author

Do we also want concurrency for different chunk sizes?

@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

Do we also want concurrency for different chunk sizes?

That would be nice, if the chunk sizes are somewhat compatible, i.e. one is a multiple of the other.

src/zarr/core/array.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor

d-v-b commented Jan 8, 2025

  • (Is there some measure to prevent this that I am not aware of?)

if you are trying to write K input chunks into M output chunks, you can partition your K chunks into sets, where within each set elements can be written independently from all the other elements. then you write each set one after another. in the worst case scenario there will be 1 set per chunk, but you are guaranteed to avoid write collisions this way.

@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
src/zarr/core/array.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor

d-v-b commented Jan 14, 2025

one question to answer here is what "auto" means for chunks if the user passes in a chunked array, but they want to use zarr-python's auto-chunking instead of the chunks that came with the array.

We might want to use a separate value that means "copy the chunks this object already has", which is distinct from "generate some chunks using the chunking heuristics". maybe something like ChunksLike: Literal['auto'] | Literal['keep'] | ShapeLike?

@brokkoli71
Copy link
Member Author

brokkoli71 commented Jan 15, 2025

one question to answer here is what "auto" means for chunks if the user passes in a chunked array, but they want to use zarr-python's auto-chunking instead of the chunks that came with the array.

Good point! I like the idea of distinguishing between keep and auto.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 15, 2025
@brokkoli71 brokkoli71 marked this pull request as ready for review January 15, 2025 19:55
# Conflicts:
#	src/zarr/core/array.py
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 30, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 30, 2025

now that #2761 is in, could we use from_array inside create_array (after the data / dtype / shape validation)?

@brokkoli71
Copy link
Member Author

brokkoli71 commented Jan 30, 2025

now that #2761 is in, could we use from_array inside create_array (after the data / dtype / shape validation)?

@d-v-b Currently, in this PR from_array calls create_array. Is it redundant to have both from_array and create_array to create an array from another array? Or do you see a benefit in having both?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 30, 2025

now that #2761 is in, could we use from_array inside create_array (after the data / dtype / shape validation)?

@d-v-b Currently, in this PR from_array calls create_array. Is it redundant to have both from_array and create_array to create an array from another array? Or do you see a benefit in having both?

create_array should call from_array if the user provided data; from_array should call the newly added init_array to persist the array metadata, and then store the array data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] zarr.array from from an existing zarr.Array
4 participants