Datatree prototyping #360

sjperkins · 2025-02-11T16:19:22Z

Prototypes DataTree #350

This draft PR contains a script that re-arranges a Processing Set to produce a Datatree corresponding to the options present in this document:

https://confluence.skatelescope.org/display/SEC/Datatree+proposal

Invoke the script with

$ python src/xradio/datatree/dt_convert.py ~/data/VLBA_TL016B_split_lsrk.vis.zarr

There are multiple options to configure the resulting Datatree, call the script with --help to discover them.

This PR has only been tested with the VLBA_TL016B_split_lsrk dataset.

~~Accessors are still TODO, but it might be useful to start getting a feel for the layout.~~ Accessors have been implemented for the various options

/cc @Jan-Willem @scpmw

CLAassistant · 2025-02-11T16:19:35Z

All committers have signed the CLA.

src/xradio/datatree/datatree_builder.py

scpmw · 2025-02-12T13:42:20Z

src/xradio/datatree/datatree_builder.py

+  2.0,
+  2.5,
+  3.0
+}


You would have been forgiven to just pick your favourite one ;)

Thanks :-) This does however allow us to explore multiple options and identify possible pitfalls in each one.

scpmw · 2025-02-12T13:44:59Z

src/xradio/datatree/datatree_builder.py

+          f"{self._destination_url} does not have a .zarr extension "
+          f"and will not be automatically recognised by xarray as a zarr store.\n"
+          f"The engine will need to be explicitly specified via "
+          f"xarray.open_datatree({self._destination_url}, engine='zarr')")


I believe .ps is currently used? No strong opinion, though the collision with PostScript feels strange to me.

... also might want to move this to with_destination, maybe?

The destination url can be named whatever the user wishes, but the warning is there to give the user a hint as to why a plain xarray.open_datatree(url) will fail (The zarr backend expects a .zarr extension at the zarr root group).

... also might want to move this to with_destination, maybe?

Certainly possible, we could change this if the Builder was necessary going forward. I suspect it won't be once we settle on an option. The prototype exists to explore the design space.

scpmw · 2025-02-12T13:53:05Z

src/xradio/datatree/datatree_builder.py

+            dest = os.path.join(destination_root, partition)
+          elif dataset == "antenna_xds":
+            # The antenna dataset is a child of the tree root
+            # For prototyping purposes, we only write it once


Right, would normally need to check equivalence. Honestly not sure myself whether it's worth the extra effort.

We can leave a note here that if we choose this option, equivalence tests are possible with ds1.identical(ds2) where ds1 and ds2 are xarray.Dataset variables. This should also work with xarray.Datatree but checking the entire tree would IMO be overkill here.

scpmw · 2025-02-12T13:56:09Z

src/xradio/datatree/datatree_accessor.py

+  pass
+
+
+@xarray.register_datatree_accessor("msa")


Hm, would at least drop the "a". This needs to be memorable - alternative might be msv4? Though not sure what we will do with image or calibration datamodels then.

msa is an acronym for Measurement Set Accessor which signals its purpose fairly clearly IMO. To me ms or msv4 impllies that the Measurement Set is an attribute of the main Datatree structure, when actually the Measurement Set is equivalent to the Datatee.

Regarding image and calibration data models, it should be possible to create separate accessors for them. After, having written the previous paragraph, dt.cala and dt.imga seem aesthetically unappealing to me => dt.cal and dt.{img,image} are nicer IMO.

maybe dt.ms is a good compromise for visibility data. Other ideas:

dt.msx

dt.imgx

dt.calx

which phonetically might imply the accessor idea.

I's effectively a namespace, so I'm still somewhat partial to xrd. msx also has a nice ring to it.

I's effectively a namespace, so I'm still somewhat partial to xrd. msx also has a nice ring to it.

I would argue that xrd is an implementation of the spec, while ms{v4} is the spec.

As the example details a geo accessor (presumably for geosciences) how about using a radio astronomy related acronoym:

xra: xarray radio astronomy

rax: radio astronomy xarray/axxessor

ra may be to close to right ascension

scpmw · 2025-02-12T13:59:06Z

src/xradio/datatree/datatree_accessor.py

+
+  def __init__(self, datatree: DataTree):
+    self._dt = datatree
+    root = self._dt.root


Wondering whether this would also be the place for a schema check... Could see both sides: I'd be nice if you could just go ds.msv4, and you immediately had some likely assumptions checked. However, might become bothersome if we'd end up artificially gating something behind it. We'd likely want to provide some sort of "skip schema check" environment.

I suspect that this is called on every Datatree node so their might be performance implications. Another place to do the checks might be on the accessor. datatree.ms.validate() for example.

I think the accessor gets created lazily - so the check would "only" happen the first time you use the accessor on any particular datatree node. Which would seem reasonable to me...

Just did a quick check and it seems that it creates an accessor for each node. Filtering doesn't not seem to create new nodes (or accessors) but admittedly I haven't tried anything complicated re: filtering.

$ python src/xradio/datatree/dt_convert.py ~/data/VLBA_TL016B_split _lsrk.vis.zarr/ Datatree Build Strategy * Datatree proposal option:'1.0' * Data in '/home/simon/data/VLBA_TL016B_split_lsrk.vis.zarr' will be copied to '/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr' * '/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr' will be overwritten * Metadata will be consolidated at the Datatree root dt = xarray.open_datatree("/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr", engine="zarr") Creating MS Accessor / Creating MS Accessor /VLBA_TL016B_split_0 Creating MS Accessor /VLBA_TL016B_split_1 Creating MS Accessor /VLBA_TL016B_split_2 Creating MS Accessor /VLBA_TL016B_split_3 Creating MS Accessor /VLBA_TL016B_split_0/antenna_xds Creating MS Accessor /VLBA_TL016B_split_0/correlated_xds Creating MS Accessor /VLBA_TL016B_split_0/field_and_source_xds_base Creating MS Accessor /VLBA_TL016B_split_0/gain_curve_xds Creating MS Accessor /VLBA_TL016B_split_0/system_calibration_xds Creating MS Accessor /VLBA_TL016B_split_0/weather_xds Creating MS Accessor /VLBA_TL016B_split_1/antenna_xds Creating MS Accessor /VLBA_TL016B_split_1/correlated_xds Creating MS Accessor /VLBA_TL016B_split_1/field_and_source_xds_base Creating MS Accessor /VLBA_TL016B_split_1/gain_curve_xds Creating MS Accessor /VLBA_TL016B_split_1/system_calibration_xds Creating MS Accessor /VLBA_TL016B_split_1/weather_xds Creating MS Accessor /VLBA_TL016B_split_2/antenna_xds Creating MS Accessor /VLBA_TL016B_split_2/correlated_xds Creating MS Accessor /VLBA_TL016B_split_2/field_and_source_xds_base Creating MS Accessor /VLBA_TL016B_split_2/gain_curve_xds Creating MS Accessor /VLBA_TL016B_split_2/system_calibration_xds Creating MS Accessor /VLBA_TL016B_split_2/weather_xds Creating MS Accessor /VLBA_TL016B_split_3/antenna_xds Creating MS Accessor /VLBA_TL016B_split_3/correlated_xds Creating MS Accessor /VLBA_TL016B_split_3/field_and_source_xds_base Creating MS Accessor /VLBA_TL016B_split_3/gain_curve_xds Creating MS Accessor /VLBA_TL016B_split_3/system_calibration_xds Creating MS Accessor /VLBA_TL016B_split_3/weather_xds

src/xradio/datatree/datatree_accessor.py

scpmw · 2025-02-12T14:05:32Z

src/xradio/datatree/datatree_accessor.py

+      elif self._option == 2.0:
+        antenna_ds = self._dt.children[name]
+      elif self._option == 2.5:
+        antenna_ds = self._dt.root[name]


To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).

Thanks for clarifying. My immediate thought here is that the flexibility is going to be confusing for software development, but probably solvable with links (which are not been explored in this PR, yet).

Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?

I like option 2.5 as it uses Datatree's coordinate inheritance which should work seamlessly with @Jan-Willem 's work on interpolating visibility coordinates in secondary dataset. This seems to work on the single VLBA I've been testing on but probably needs to be smoke tested on the rest of the MSv4 test data.

This might be better to discuss in person in the wG

but probably solvable with links (which are not been explored in this PR, yet).

One of the findings I take from this prototype is that it's fairly easy to navigate the various options using the standard DataTree methods (root, parent, children, siblings).

To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).

Maybe this is possible without links. What about this idea?

If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset

Same logic applies for other secondary datasets

This could be enforced via the dt.msa.antenna accessor.

Maybe this is possible without links. What about this idea?

If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset

Same logic applies for other secondary datasets

This could be enforced via the dt.msa.antenna accessor.

The danger with this approach maybe global/parition datasets getting out of sync. Perhaps it should be enforced as global xor partition by both the schema checker and accessor.

This would be a relaxation of the following:

Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?

Maybe this is possible without links. What about this idea?

If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset

Same logic applies for other secondary datasets

Yes, that was the proposal. I'm giving pseudo-code on Confluence:

# Option 2.5 for parent in pathlib.parents(path): if 'ANTENNA' in parent.children: antenna = parent.children['ANTENNA']

And just to be clear - I would just propose to support this in the accessor, but not necessarily go out of our way to actually make systematic use of it. It's just to retain the optimisation option.

sjperkins · 2025-02-20T09:52:57Z

tests/stakeholder/test_measure_set_stakeholder.py

+        for node in dt.subtree:
+            if node.attrs.get("type") in VISIBILITY_DATASET_TYPES:
+                assert isinstance(node.msa.antennas, xarray.DataTree)
+


This does some basic testing that datatree conversion and loading works for the different options on the stakeholder test set.

The following tests fail

FAILED tests/stakeholder/test_measure_set_stakeholder.py::test_s3 - FileNotFoundError: s3://viper-test-data/Antennae_North.cal.lsrk.split.py39.v7.vis.zarr does not exist or is not a directory FAILED tests/stakeholder/test_measure_set_stakeholder.py::test_preconverted_alma - FileNotFoundError: No such file or directory: '/home/simon/code/xradio/Antennae_North.cal.lsrk.split.py39.vis.zarr/Antennae_North.cal.lsrk.split.py39_1/correlated_xds'

The s3 test probably fails because I don't have s3 installed, while the preconverted alma test case probably fails because I did not download and preconvert it.

When I run this test I get partitions with names like:

Antennae_North.cal.lsrk.split.py39.vis.zarr/Antennae_North.cal.lsrk.split_00/ Antennae_North.cal.lsrk.split.py39.vis.zarr/Antennae_North.cal.lsrk.split_01/ Antennae_North.cal.lsrk.split.py39.vis.zarr/Antennae_North.cal.lsrk.split_02/ ...

which suggest that the first issue is with the name of the partition (MSv4) subdirs (the missing .py39 in the MSv4 dir names and the two digits index at the end seem to differ from what it is expected in this branch).
The open_processing_set in main seems not very strict about that and is fine with the name discrepancy between the PS level and the MSv4 level.

Digging into this a bit more, it's because the DatatreeBuilder is only using the standard posix interface to access the PS. Then, the s3 test fails because it treats the s3 url as a posix filename:

test_3 backtrace

$ py.test -s -vvv -k test_s3 --pdb ========================================================================================================== test session starts ========================================================================================================== platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0 -- /home/simon/venv/xradio/bin/python cachedir: .pytest_cache metadata: {'Python': '3.11.11', 'Platform': 'Linux-6.8.0-53-generic-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'cov': '6.0.0', 'metadata': '3.1.1', 'typeguard': '4.4.1', 'html': '4.1.1'}} rootdir: /home/simon/code/xradio configfile: pyproject.toml testpaths: tests plugins: cov-6.0.0, metadata-3.1.1, typeguard-4.4.1, html-4.1.1 collected 357 items / 356 deselected / 1 selected tests/stakeholder/test_measure_set_stakeholder.py::test_s3 [2025-02-20 17:37:17,217] INFO viperlog: Checking functions availability: [2025-02-20 17:37:17,217] INFO viperlog: Loading module: slurm -- Success [2025-02-20 17:37:17,217] INFO viperlog: dask_jobqueue is available [2025-02-20 17:37:17,217] INFO viperlog: Loading module: dask_ssh -- Fail [2025-02-20 17:37:17,217] INFO viperlog: asyncssh is unavailable [2025-02-20 17:37:17,217] INFO viperlog: jupyter_server_proxy is unavailable [2025-02-20 17:37:17,217] INFO viperlog: paramiko is unavailable [2025-02-20 17:37:17,217] INFO viperlog: Loading module: CUDA -- Fail [2025-02-20 17:37:17,217] INFO viperlog: dask_cuda is unavailable [2025-02-20 17:37:17,217] INFO viperlog: Available functions of this environment: slurm FAILED >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO viperlog:logger.py:56 Checking functions availability: INFO viperlog:logger.py:56 Loading module: slurm -- Success INFO viperlog:logger.py:56 dask_jobqueue is available INFO viperlog:logger.py:56 Loading module: dask_ssh -- Fail INFO viperlog:logger.py:56 asyncssh is unavailable INFO viperlog:logger.py:56 jupyter_server_proxy is unavailable INFO viperlog:logger.py:56 paramiko is unavailable INFO viperlog:logger.py:56 Loading module: CUDA -- Fail INFO viperlog:logger.py:56 dask_cuda is unavailable INFO viperlog:logger.py:56 Available functions of this environment: slurm >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tmp_path = PosixPath('/tmp/pytest-of-simon/pytest-10/test_s30') def test_s3(tmp_path): # Similar to 'test_preconverted_alma' if this test fails on its own that # probably is because the schema, the converter or the schema cheker have # changed since the dataset was uploaded. > base_test( "s3://viper-test-data/Antennae_North.cal.lsrk.split.py39.v7.vis.zarr", tmp_path, 190.0405216217041, is_s3=True, partition_schemes=[[]], do_schema_check=False, ) tests/stakeholder/test_measure_set_stakeholder.py:227: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/stakeholder/test_measure_set_stakeholder.py:135: in base_test datatree_convert_and_open_test(ps_name, folder) tests/stakeholder/test_measure_set_stakeholder.py:94: in datatree_convert_and_open_test builder.build() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <xradio.datatree.datatree_builder.DatatreeBuilder object at 0x7c6318d09650> def build(self): """ Build the Datatree from the configured input """ if self._option not in VALID_OPTIONS: raise ValueError(f"{self._option} is a valid datatree struture option {VALID_OPTIONS}") if not self._url or not os.path.exists(self._url) or not os.path.isdir(self._url): > raise FileNotFoundError(f"{self._url} does not exist or is not a directory") E FileNotFoundError: s3://viper-test-data/Antennae_North.cal.lsrk.split.py39.v7.vis.zarr does not exist or is not a directory src/xradio/datatree/datatree_builder.py:142: FileNotFoundError >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > /home/simon/code/xradio/src/xradio/datatree/datatree_builder.py(142)build() -> raise FileNotFoundError(f"{self._url} does not exist or is not a directory")

and the other error in #360 (comment) is because I did not download the preconverted PS

sjperkins · 2025-02-20T16:14:38Z

I've signed the CLA, and am happy for code in this PR to be merged. However as discussed in the WG meeting, most of the code is probably unnecessary now that a Datatree representation has been selected. I think the accessor is the only thing that may be useful, going forward. Feel free to use it or parts of it here or elsewhere.

sjperkins added 2 commits February 11, 2025 18:10

Initial Datatree prototype

4b463b9

Add help for --move

b4341b6

sjperkins added 4 commits February 12, 2025 13:42

Add accessor with some basic tests

d75f7f9

comment

335827e

Move and re-use constants

720bda2

import fixes

b5205cf

scpmw reviewed Feb 12, 2025

View reviewed changes

sjperkins added 5 commits February 13, 2025 10:11

Rename antenna accessor method to antennas

6b2abe7

Register accessor on import of xradio.datatree module

6efad6a

Move with_url functionality into DatatreeBuilder constructor

330996f

Smoke test datatree creation and loading

eeeb30d

Merge branch 'main' into datatree-prototyping

b4c2b4b

sjperkins commented Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datatree prototyping #360

Datatree prototyping #360

sjperkins commented Feb 11, 2025 •

edited

Loading

CLAassistant commented Feb 11, 2025 •

edited

Loading

scpmw Feb 12, 2025

sjperkins Feb 13, 2025

scpmw Feb 12, 2025

scpmw Feb 12, 2025

sjperkins Feb 13, 2025

sjperkins Feb 13, 2025

scpmw Feb 12, 2025

sjperkins Feb 13, 2025 •

edited

Loading

scpmw Feb 12, 2025

sjperkins Feb 13, 2025

scpmw Feb 13, 2025 •

edited

Loading

sjperkins Feb 13, 2025

sjperkins Feb 13, 2025 •

edited

Loading

scpmw Feb 12, 2025

sjperkins Feb 13, 2025

scpmw Feb 13, 2025

sjperkins Feb 13, 2025

scpmw Feb 12, 2025

sjperkins Feb 13, 2025

sjperkins Feb 13, 2025

sjperkins Feb 13, 2025

sjperkins Feb 13, 2025 •

edited

Loading

scpmw Feb 13, 2025

scpmw Feb 13, 2025

sjperkins Feb 20, 2025 •

edited

Loading

FedeMPouzols Feb 20, 2025 •

edited

Loading

sjperkins Feb 20, 2025

sjperkins commented Feb 20, 2025

+.0,
+.5,
+.0
+              }

		pass


		@xarray.register_datatree_accessor("msa")

Datatree prototyping #360

Are you sure you want to change the base?

Datatree prototyping #360

Conversation

sjperkins commented Feb 11, 2025 • edited Loading

CLAassistant commented Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjperkins Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scpmw Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjperkins Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjperkins Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjperkins Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

FedeMPouzols Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjperkins commented Feb 20, 2025

sjperkins commented Feb 11, 2025 •

edited

Loading

CLAassistant commented Feb 11, 2025 •

edited

Loading

sjperkins Feb 13, 2025 •

edited

Loading

scpmw Feb 13, 2025 •

edited

Loading

sjperkins Feb 13, 2025 •

edited

Loading

sjperkins Feb 13, 2025 •

edited

Loading

sjperkins Feb 20, 2025 •

edited

Loading

FedeMPouzols Feb 20, 2025 •

edited

Loading