Asdf read speed #514

SolarDrew · 2025-01-30T14:50:18Z

Fixes #500

Cadair · 2025-01-30T15:21:22Z

dkist/dataset/tiled_dataset.py

-        for i, (fm, wcs, headers) in enumerate(zip(file_managers, wcses, header_tables)):
+        all_headers = vstack(header_tables)
+        for i, (fm, wcs) in enumerate(zip(file_managers, wcses)):
+            headers = all_headers[i*len(fm):(i+1)*len(fm)]


That's some promises about ordering I was not intending to make.

Cadair · 2025-02-04T09:46:30Z

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code

import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.

I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.

dkist/io/asdf/resources/manifests/dkist-1.2.0.yaml

Cadair · 2025-02-04T11:32:06Z

dkist/io/asdf/converters/tiled_dataset.py


    def to_yaml_tree(cls, tiled_dataset, tag, ctx):
        tree = {}
        tree["inventory"] = tiled_dataset._inventory
        tree["datasets"] = tiled_dataset._data.tolist()
+        tree["headers"] = tiled_dataset.combined_headers.as_array()


We should do this for dataset too.

SolarDrew added 8 commits January 28, 2025 10:32

Add mechanism for datasets to know if they're a tile

40af8bf

Stack headers and store canonically on TiledDataset

2d35c18

Don't save out headers on mosaic tile Datasets

ec5aaef

Minor test upgrade

afd6f70

Pass headers to TiledDataset in simple_tiled_dataset fixture

d2d29c0

Need to stack the headers

a4d1efe

Make TiledDataset converter read and write headers

28f19ab

Schema nonsense

3601236

Cadair reviewed Jan 30, 2025

View reviewed changes

SolarDrew added 3 commits February 3, 2025 10:23

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

8bb16cc

Needed to point the manifest at the right schema

6343fa8

Replace changes to manifest with new file

2b2b125

SolarDrew added 2 commits February 4, 2025 11:21

More schema schenanigans

4e42f22

Save header table as rec array

374526d

Cadair reviewed Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asdf read speed #514

Asdf read speed #514

SolarDrew commented Jan 30, 2025

Cadair Jan 30, 2025

Cadair commented Feb 4, 2025 •

edited

Loading

Cadair Feb 4, 2025

Asdf read speed #514

Are you sure you want to change the base?

Asdf read speed #514

Conversation

SolarDrew commented Jan 30, 2025

Cadair Jan 30, 2025

Choose a reason for hiding this comment

Cadair commented Feb 4, 2025 • edited Loading

Cadair Feb 4, 2025

Choose a reason for hiding this comment

Cadair commented Feb 4, 2025 •

edited

Loading