Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] Hierarchy api #1912

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

[v3] Hierarchy api #1912

wants to merge 22 commits into from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 26, 2024

This PR adds a declarative API for defining Zarr arrays and groups independently of storage. Using this API, users and developers can create and manipulate Zarr hierarchies, adding nodes and modifying their attributes, and serialize the hierarchy to storage with a single method call.

Implementation

This PR adds a module called hierarchy.py that contains two classes, ArrayModel and GroupModel, which model Zarr arrays and groups, respectively. "Model" here is an important concept;ArrayModel has all the array metadata attributes like shape and dtype, but ArrayModel has no connection to storage, or chunks, so you can't use ArrayModel to read and write array data. Similarly for GroupModel -- it has all the static attributes of a Zarr group, but no connection to storage, so you cannot access sub-groups or sub-arrays with a GroupModel. (You can, however, access sub-GroupModel and sub-ArrayModel instances, but these are just models). The classes are pretty simple, so I will just paste the current code here:

class ArrayModel(ArrayV3Metadata):
    """
    A model of a Zarr v3 array.
    """

    @classmethod
    def from_stored(cls: type[Self], node: Array) -> Self:
        """
        Create an array model from a stored array.
        """
        return cls.from_dict(node.metadata.to_dict())

    def to_stored(self, store_path: StorePath, exists_ok: bool = False) -> Array:
        """
        Create a stored version of this array.
        """
        # exists_ok kwarg is unhandled until we wire it up to the
        # array creation routines

        return Array.from_dict(store_path=store_path, data=self.to_dict())


@dataclass(frozen=True)
class GroupModel(GroupMetadata):
    """
    A model of a Zarr v3 group.
    """

    members: dict[str, GroupModel | ArrayModel] | None = field(default_factory=dict)

    @classmethod
    def from_stored(cls: type[Self], node: Group, *, depth: int | None = None) -> Self:
        """
        Create a GroupModel from a Group. This function is recursive. The depth of recursion is
        controlled by the `depth` argument, which is either None (no depth limit) or a finite natural number
        specifying how deep into the hierarchy to parse.
        """
        members: dict[str, GroupModel | ArrayModel] = {}

        if depth is None:
            new_depth = depth
        else:
            new_depth = depth - 1

        if depth == 0:
            return cls(**node.metadata.to_dict(), members=None)

        else:
            for name, member in node.members:
                item_out: ArrayModel | GroupModel
                if isinstance(member, Array):
                    item_out = ArrayModel.from_stored(member)
                else:
                    item_out = GroupModel.from_stored(member, depth=new_depth)

                members[name] = item_out

        return cls(attributes=node.metadata.attributes, members=members)

Goals

  • This work is necessary for single-shot hierarchy creation with batched IO. If we can leverage batched IO operations, it should be possible to concurrently write (and read) all the zarr.json metadata documents in a large hierarchy, which should vastly speed up these interactions on high latency storage
  • a flattened consolidated-metadata-like internal representation for easy hierarchy creation. A Zarr hierarchy can be represented as dict[str_that_obeys_path_semantics, ArrayModel | GroupModel]. This has been useful over in pydantic-zarr for a variety of things, and I think it would be useful here. It could also provide a serialization format for consolidated metadata in zarr v3, which so far has not been defined.

Process

Unlike a lot of other v3 efforts, this PR adds new functionality that was never in zarr-python before. I'm basing the design here on work I did over in pydantic-zarr, so there's some of prior art, but I am happy to explore and experiment as needed. It might take a while before we have an API everyone is happy with.

@d-v-b d-v-b added the V3 Affects the v3 branch label May 30, 2024
@d-v-b d-v-b marked this pull request as ready for review May 30, 2024 15:19

@classmethod
def from_dict(
async def from_dict(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a notable change that was needed to get the hierarchy API to work. Previously, from_dict was not async, but it should be.

return Array.from_dict(store_path=store_path, data=self.to_dict())

@classmethod
def from_array(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a convenience method that makes it less painful to create ArrayModel instances, because it uses as many defaults / inferred values as possible.

codecs: Iterable[Codec | JSON],
attributes: None | dict[str, JSON],
dimension_names: None | Iterable[str],
codecs: Iterable[Codec | JSON] = (BytesCodec(),),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some default values here to make array creation easier. happy to revert if this is controversial.

@d-v-b d-v-b requested review from normanrz and jhamman May 30, 2024 15:23
d-v-b and others added 17 commits June 1, 2024 13:45
* Run sphinx directly on readthedocs

* Update doc build script
Bumps the actions group with 6 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `3` | `4` |
| [github/codeql-action](https://github.com/github/codeql-action) | `2` | `3` |
| [actions/setup-python](https://github.com/actions/setup-python) | `4` | `5` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `3` | `4` |
| [actions/download-artifact](https://github.com/actions/download-artifact) | `3` | `4` |
| [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) | `1.8.10` | `1.8.14` |


Updates `actions/checkout` from 3 to 4
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

Updates `github/codeql-action` from 2 to 3
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v2...v3)

Updates `actions/setup-python` from 4 to 5
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v4...v5)

Updates `actions/upload-artifact` from 3 to 4
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v3...v4)

Updates `actions/download-artifact` from 3 to 4
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@v3...v4)

Updates `pypa/gh-action-pypi-publish` from 1.8.10 to 1.8.14
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.8.10...v1.8.14)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/download-artifact
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: pypa/gh-action-pypi-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Hamman <[email protected]>
* Apply  ruff rule RUF022

RUF022 `__all__` is not sorted

* Apply ruff rule RUF029

RUF029 Function is declared `async`, but doesn't `await` or use `async` features.
RUF009 Do not perform function call `cast` in dataclass defaults
* feature: group and array path/name/basename properties

* tests
* implement .chunks on v3 arrays

* remove noqa: B009

* make mypy happy

* only return chunks for regular chunk grids

---------

Co-authored-by: Davis Bennett <[email protected]>
Co-authored-by: Joseph Hamman <[email protected]>
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.4.5 → v0.4.7](astral-sh/ruff-pre-commit@v0.4.5...v0.4.7)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@jhamman jhamman added this to the 3.0.0.alpha milestone Jul 1, 2024
@jhamman jhamman modified the milestones: 3.0.0.alpha, After 3.0.0 Jul 2, 2024
@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants