Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata correction #2177

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

candleindark
Copy link
Member

This PR provides a solution to correct the corruptions in metadata documented in dandi/dandi-schema#276.

The solution is implemented in two parts.

  1. A general command, correct_metadata, to correct dandiset metadata.
  2. A specific helper function which the general command calls to do the actual correction.

Extensive tests are provided for the helper function and its supporting function. However, because I am not familiar with this repo and Django in general, I am not able to provide tests for the command which interacts with the database. Advice and additional tests are very much appreciated.

The command can be run on a targeted dandiset at a particular version and run on all versions of all dandisets. Running on all dandiset versions will only correct the corrupted dandisets. If running only on targeted dandiset version is preferable, please let me know, and I will provide the list of corrupted dandiset versions. (Additionally, would changing the interface of the command to a file consisting of corrupted dandiset versions be better?)

@yarikoptic yarikoptic requested a review from asmacdo February 10, 2025 16:52
Copy link
Member

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of questions to start off with-- I'll have another look tomorrow.

ver.save() # TODO: should I use `save()` in this case? It has some custom logic.

for ver, _ in changes:
write_manifest_files.delay(ver.id) # TODO: Please check if this is needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if any of the file writes fail or time out?

write_dandiset_yaml(version)
write_assets_yaml(version)
write_dandiset_jsonld(version)
write_assets_jsonld(version)
write_collection_jsonld(version)

Maybe we should at least keep the AsyncResults and monitor that they go as planned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know if I need to call write_manifest_files.delay(ver.id) at all. I was trying to mimic

, but https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/management/commands/migrate_version_metadata.py didn't call write_manifest_files() at all.

Maybe we should at least keep the AsyncResults and monitor that they go as planned?

I think the answer is related to how important that statement gets executed successfully. In the example above, its execution status is not checked. If a success execution is important, I guess I can't interact with the DB the current way.

@dandi/dandiarchive

Note: This function corrects the corruptions described in
https://github.com/dandi/dandi-schema/issues/276
"""
unwanted_fields = ['contactPoint', 'includeInCitation', 'roleName']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the original intent of the change that led to this problem was to remove all fields but the allowed ones-- would it be beneficial to also have a list of allowed_fields to confirm that they are the only ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In concept, you are right. I considered do that as well. However, I want to be careful for now. I want to remove only the ones that I know are causing problems. I don't want to remove anything that we haven't analyzed previously.

@yarikoptic yarikoptic added internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling labels Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants