Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asdf read speed #514

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Asdf read speed #514

wants to merge 13 commits into from

Conversation

SolarDrew
Copy link
Contributor

Fixes #500

for i, (fm, wcs, headers) in enumerate(zip(file_managers, wcses, header_tables)):
all_headers = vstack(header_tables)
for i, (fm, wcs) in enumerate(zip(file_managers, wcses)):
headers = all_headers[i*len(fm):(i+1)*len(fm)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's some promises about ordering I was not intending to make.

@Cadair
Copy link
Member

Cadair commented Feb 4, 2025

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code
import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.


I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

  • Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
  • We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
  • This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.


def to_yaml_tree(cls, tiled_dataset, tag, ctx):
tree = {}
tree["inventory"] = tiled_dataset._inventory
tree["datasets"] = tiled_dataset._data.tolist()
tree["headers"] = tiled_dataset.combined_headers.as_array()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this for dataset too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading a DL-NIRSP ASDF is very slow
2 participants