Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI/UX improvements for zarr uploads (validation, feedback loop) #1811

Closed
aaronkanzer opened this issue Jan 11, 2024 · 6 comments
Closed

UI/UX improvements for zarr uploads (validation, feedback loop) #1811

aaronkanzer opened this issue Jan 11, 2024 · 6 comments
Labels
question Further information is requested zarr Issues with Zarr hosting/processing/etc.

Comments

@aaronkanzer
Copy link
Member

aaronkanzer commented Jan 11, 2024

Summary

(I am a sample size of 1 opinion, so feel free to push back 😄 , nevertheless here are the details)

As of now, dandisets with only zarr files remain in draft state due to their inability to be versioned. This is intended. However, during upload, this isn't 100% expressed with misleading validation provided to the end user. Fortunately, the upload process for the zarr files uploading to S3 and Dandi works successfully (the data/files are there after the user invokes an upload)

The initial goal of this Issue would be to discuss the appropriate UI/UX for this scenario (Cc @kabilar @waxlamp @yarikoptic @satra ). The issue was initially observed in linc-archive (a fork of dandi-archive) but then replicated in dandi-archive production environment. @kabilar and I are curious to get the rest of the Dandi team's thoughts before proceeding further.

Details

This was discovered via the following workflow:

  1. Create a new dandiset
  2. Call dandi download <dandiset-url/draft> to get the dandiset.yaml locally
  3. Associate a valid zarr in the same root directory as the dandiset.yaml
  4. Call dandi validate . to confirm that the file is valid
  5. Call dandi upload to upload the data to S3/dandi-archive
  6. Visit the dandiset in the UI -- you'll see asset validation issues; however, you'll also see the files present under Files
  7. Visit the relevant S3 bucket -- you'll see the files also present there (e.g. upload for the file(s) itself seem to work.

The response during dandi upload is the following, where the dandiset.yaml is supposed to be edited online.

PATH                 SIZE    ERRORS      UPLOAD STATUS                        MESSAGE                  
dandiset.yaml        1.6 kB                     skipped                       should be edited online  
mydata.zarr          17.5 kB   0           100% done                                                   
Summary:             19.1 kB           4.3 kB/s 1 skipped                     1 should be edited online

Visiting the UI after getting that response via dandi upload, it isn't 100% apparent where the user would go to edit the dandiset.yaml

The next observance is a failed validation error due to the values of assetsSummary.numberOfBytes & assetsSummary.numberOfFiles remaining as zero even though data and files have been uploaded -- validation error stems from this line of code.

This workflow/feedback loop seems that it could be improved for the end user -- I assume we want to provide certainty and confidence that their zarr uploads worked as intended.

As an aside, one of the inconsistencies we noticed was with dandisets that contain both zarr and non-zarr files. It seems that they create a false positive with Asset Summary passing validation since both bytes and files are updated to be greater than zero.

Next Steps

@aaronkanzer to further investigate why assetsSummary.numberOfBytes & assetsSummary.numberOfFiles remain at zero even after successful upload (e.g. nwb files work in this case, so why not zarr...)

Appendix / References

To create the zarr asset in this workflow, the following Python script was used:

from zarr_checksum import compute_zarr_checksum
from zarr_checksum.generators import yield_files_local
import zarr
import numpy as np

# Adjust these parameters to change the size of the Zarr file
array_shape = (100, 100)  # Size of the array
chunks = (10, 10)  # Chunk size
dtype = 'i4'  # Data type (4-byte integers)

# Create a large 2D array
z = zarr.array(np.random.randint(0, 1000, size=array_shape, dtype=dtype), chunks=chunks)

# Save the array to a Zarr file
zarr.save('mydata.zarr', z)

# Compute and print the checksum
checksum = compute_zarr_checksum(yield_files_local("mydata.zarr"))
print(checksum.digest)

# Load, modify, and save the array
z = zarr.load('mydata.zarr')
print("Original data:\n", z)
z += 10
zarr.save('mydata.zarr', z)
print("Modified data:\n", z)

# Compute and print the new checksum
checksum = compute_zarr_checksum(yield_files_local("mydata.zarr"))
print(checksum.digest)
@aaronkanzer aaronkanzer changed the title Validation errors for zarr-only uploads seem to be misleading UI/UX improvements for zarr-only uploads (validation, feedback loop) Jan 11, 2024
@aaronkanzer aaronkanzer changed the title UI/UX improvements for zarr-only uploads (validation, feedback loop) UI/UX improvements for zarr uploads (validation, feedback loop) Jan 11, 2024
@waxlamp
Copy link
Member

waxlamp commented Jan 11, 2024

I think the reason for the number-of-[bytes|files] inconsistency is because of how we track Zarr archives separately from all other file types. There are reasons for this, but it is clear that that decision has led to confusing things like this (if my attribution here is correct), which I would indeed consider a bug.

Thanks for the detailed trail of activity you engaged in to reproduce the behaviors you're talking about. However, it's hard for me to be certain just which things you are reporting here as bugs or difficulties. I'd suggest that we meet if you want to show me interactively exactly where you ran into frustration, or update your issue description with a summary of incorrect behaviors and/or changes you'd like to see.

To kick things off, I agree that the should be edited online message is confusing. That one would be appropriate to file over in the dandi-cli repository (while installing a reference to it in this issue).

And, I believe you are reporting inconsistent behavior in steps 6 and 7 of your repro workflow. One by one:

  1. Visit the dandiset in the UI -- you'll see asset validation issues; however, you'll also see the files present under Files

What were you expecting in place of this situation? I assume it's something like: "if the zarr archive passed validation, why is it showing validation errors after upload?"

But, I'm not sure why you wouldn't expect it to have been uploaded, given that the CLI passed it as valid. Still, I can sense the general contradiction here.

  1. Visit the relevant S3 bucket -- you'll see the files also present there (e.g. upload for the file(s) itself seem to work.

Connected to the above, what was your expectation here?

Some background may be helpful (subject to what you expected to see): we currently don't allow publishing of Zarr-containing Dandisets, as you mentioned. One mechanism to enforce this is validation--in this case, a "fake" validation that simply indicates that the Dandiset contains any assets at all. Once we figure out a way to deal with the size and peculiarities of Zarrs, we intended to remove that ad-hoc validation step, thus clearing all otherwise-valid, Zarr-containing Dandisets to become publishable.

Does that help to account for the oddities you're experiencing?

@aaronkanzer
Copy link
Member Author

aaronkanzer commented Jan 12, 2024

Thanks for all the feedback @waxlamp -- it might be more effective to walk through interactively with @kabilar if possible to refine where we want to go with this in the short-term (and a sanity check for linc-archive as well 😂 )

To your question of: are these bugs vs. confusion -- we would classify this as confusion (we also noticed in the handbook that there isn't much explanation for how/where handling of zarr files in the archive would differ from other file types, etc.) The only buggy nature we have observed thus far is validation failing at assetsSummary.numberOfBytes & assetsSummary.numberOfFiles (whereas to your point, it should fail validation elsewhere ideally if the files were invalid, or not fail at all).

Apologies for the confusion for Steps 6 & 7 above in the description of Issue -- to simplify, those steps signified "hey, although the UI/UX feedback displayed some 'errors', upload in all the right places worked 😄 " -- I think that is our hope to convey that to our end users if their zarrs are valid.

I'll start to draft some related Issues for dandi-cli, handbook and dandi-archive.

@yarikoptic
Copy link
Member

I'll start to draft some related Issues for dandi-cli

something possibly obvious but worth reiterating -- you are most welcome to propose a PR. git grep might be handy to bring you to the right point on some desired code piece ;)

@yarikoptic yarikoptic added the zarr Issues with Zarr hosting/processing/etc. label Jan 31, 2024
@waxlamp waxlamp added the question Further information is requested label Feb 1, 2024
@waxlamp
Copy link
Member

waxlamp commented Feb 1, 2024

@aaronkanzer, should we schedule a meeting to go through this stuff?

@yarikoptic
Copy link
Member

FWIW: apparently we have a good number of broken zarrs in the archive/000108 per

with chatgpt we came up with this crude "checker of the structure": https://github.com/dandi/zarr-manifests/blob/master/validate_zarr.py which unfortunately doesn't trigger on that initial reported bad zarr unless we re-enable loading that "slice" ... so smth to be figured out about that. But the point is that we might want to look into some relatively speedy validation to be done on zarrs in the archive to validate their internal integrity.

@dandi dandi locked and limited conversation to collaborators Feb 22, 2024
@waxlamp waxlamp converted this issue into discussion #1866 Feb 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested zarr Issues with Zarr hosting/processing/etc.
Projects
None yet
Development

No branches or pull requests

3 participants