Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a script to get delta of datasets and index them #1990

Open
artntek opened this issue Oct 22, 2024 · 4 comments
Open

Write a script to get delta of datasets and index them #1990

artntek opened this issue Oct 22, 2024 · 4 comments
Assignees
Milestone

Comments

@artntek
Copy link
Contributor

artntek commented Oct 22, 2024

See #1984

For moving deployments from legacy to k8s, we rsync the data to cephfs, then rsync again just before release, in order to get the datasets that were modified after the previous rsync.

To minimize downtime, we want to index this "delta" of datasets (instead of re-indexing all) - so we need need a script that:

  1. gets the pid of each dataset modified since a given date/time, and then
  2. calls the metacat admin API to reindex those specific datasets only, by pid.
@artntek artntek self-assigned this Oct 22, 2024
@artntek artntek removed their assignment Oct 22, 2024
@mbjones
Copy link
Member

mbjones commented Oct 22, 2024

Are you rsyncing the original data and documents directories from /var/metacat, or the converted hashstore directories? If the former, the filename is a mangled version of the PID for each object. So, simply saving the list of files transferred by rsync and then unmangling the PIDs woud be the list of objects that need to be indexed. If the latter, then we'd probably need to look up the PIDs for each transferred object from hashstore's PID metadata. Either way, once you have the list of PIDs, is that sufficient?

@artntek
Copy link
Contributor Author

artntek commented Oct 22, 2024

Thanks for this info. answers:

Are you rsyncing the original data and documents directories from /var/metacat, or the converted hashstore directories?

Since we are now no longer upgrading metacat "in place" before moving to k8s, we're rsyncing the original data and documents directories from /var/metacat, and then the hashstore conversion will happen on k8s - see release plan

Either way, once you have the list of PIDs, is that sufficient?

Yes - all we need is the list of pids. I've updated the description above to make this more clear. Thanks

@artntek
Copy link
Contributor Author

artntek commented Oct 22, 2024

Additional potentially-useful info, from k8s-cluster-config/MetacatQuickRef.md:

API call to list objects in reverse order of modification:

https://test.arcticdata.io/metacat/d1/mn/v2/object

API call to see if an object exists on the k8s instance yet:

https://arctic-dev.test.dataone.org/metacat/d1/mn/v2/object/<pid>

To index a list of objects:

$   TOKEN=$( kubectl get secret MYRELEASE-indexer-token \
        -o jsonpath="{.data.DataONEauthToken}" | base64 -d )

## Assuming pidsToReindex.txt contains a list of the identifiers to be indexed...
$   for pid in $(cat pidsToReindex.txt); do \
        curl -X PUT -H "Authorization: Bearer $TOKEN" \
            "https://MYHOST/MYCONTEXT/d1/mn/v2/index?pid=$pid"; \
    done

@taojing2002
Copy link
Contributor

taojing2002 commented Oct 22, 2024 via email

@artntek artntek added this to the 3.1.0 milestone Oct 22, 2024
@artntek artntek self-assigned this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants