Write a script to get delta of datasets and index them #1990

artntek · 2024-10-22T01:10:21Z

See #1984

For moving deployments from legacy to k8s, we rsync the data to cephfs, then rsync again just before release, in order to get the datasets that were modified after the previous rsync.

To minimize downtime, we want to index this "delta" of datasets (instead of re-indexing all) - so we need need a script that:

gets the pid of each dataset modified since a given date/time, and then
calls the metacat admin API to reindex those specific datasets only, by pid.

The text was updated successfully, but these errors were encountered:

mbjones · 2024-10-22T15:41:31Z

Are you rsyncing the original data and documents directories from /var/metacat, or the converted hashstore directories? If the former, the filename is a mangled version of the PID for each object. So, simply saving the list of files transferred by rsync and then unmangling the PIDs woud be the list of objects that need to be indexed. If the latter, then we'd probably need to look up the PIDs for each transferred object from hashstore's PID metadata. Either way, once you have the list of PIDs, is that sufficient?

artntek · 2024-10-22T16:30:07Z

Thanks for this info. answers:

Are you rsyncing the original data and documents directories from /var/metacat, or the converted hashstore directories?

Since we are now no longer upgrading metacat "in place" before moving to k8s, we're rsyncing the original data and documents directories from /var/metacat, and then the hashstore conversion will happen on k8s - see release plan

Either way, once you have the list of PIDs, is that sufficient?

Yes - all we need is the list of pids. I've updated the description above to make this more clear. Thanks

artntek · 2024-10-22T16:41:56Z

Additional potentially-useful info, from k8s-cluster-config/MetacatQuickRef.md:

API call to list objects in reverse order of modification:

https://test.arcticdata.io/metacat/d1/mn/v2/object

API call to see if an object exists on the k8s instance yet:

https://arctic-dev.test.dataone.org/metacat/d1/mn/v2/object/<pid>

To index a list of objects:

$   TOKEN=$( kubectl get secret MYRELEASE-indexer-token \
        -o jsonpath="{.data.DataONEauthToken}" | base64 -d )

## Assuming pidsToReindex.txt contains a list of the identifiers to be indexed...
$   for pid in $(cat pidsToReindex.txt); do \
        curl -X PUT -H "Authorization: Bearer $TOKEN" \
            "https://MYHOST/MYCONTEXT/d1/mn/v2/index?pid=$pid"; \
    done

taojing2002 · 2024-10-22T17:35:18Z

Matthew: My understanding is that the script only figures out the new added objects. However, an object needs to be indexed as well even if it is not a newly added object but its system metadata was changed. So maybe directly looking up the modification date on the system metadata table is the fastest way. Jing

…

On Tue, Oct 22, 2024 at 9:42 AM Matthew B ***@***.***> wrote: Additional potentially-useful info, from k8s-cluster-config/MetacatQuickRef.md <https://github.nceas.ucsb.edu/NCEAS/k8s-cluster-config/blob/04aa852d0c27b22ad233ccddb21363dd737ecdfd/MetacatQuickRef.md?plain=1#L140> : API call to list objects in reverse order of modification: https://test.arcticdata.io/metacat/d1/mn/v2/object API call to see if an object exists on the k8s instance yet: https://arctic-dev.test.dataone.org/metacat/d1/mn/v2/object/<pid> To index a list of objects: $ TOKEN=$( kubectl get secret MYRELEASE-indexer-token \ -o jsonpath="{.data.DataONEauthToken}" | base64 -d ) ## Assuming pidsToReindex.txt contains a list of the identifiers to be indexed... $ for pid in $(cat pidsToReindex.txt); do \ curl -X PUT -H "Authorization: Bearer $TOKEN" \ "https://MYHOST/MYCONTEXT/d1/mn/v2/index?pid=$pid"; \ done — Reply to this email directly, view it on GitHub <#1990 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5QQDEVQRROMJHRS6VZRLDZ4Z56XAVCNFSM6AAAAABQLK34PCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZG43DKNJSHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

artntek self-assigned this Oct 22, 2024

artntek mentioned this issue Oct 22, 2024

Metacat 3.1.0 Release Plan #1984

Open

artntek removed their assignment Oct 22, 2024

artntek added this to the 3.1.0 milestone Oct 22, 2024

artntek self-assigned this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a script to get delta of datasets and index them #1990

Write a script to get delta of datasets and index them #1990

artntek commented Oct 22, 2024 •

edited

Loading

mbjones commented Oct 22, 2024

artntek commented Oct 22, 2024 •

edited

Loading

artntek commented Oct 22, 2024

taojing2002 commented Oct 22, 2024 via email

Write a script to get delta of datasets and index them #1990

Write a script to get delta of datasets and index them #1990

Comments

artntek commented Oct 22, 2024 • edited Loading

mbjones commented Oct 22, 2024

artntek commented Oct 22, 2024 • edited Loading

artntek commented Oct 22, 2024

taojing2002 commented Oct 22, 2024 via email

artntek commented Oct 22, 2024 •

edited

Loading

artntek commented Oct 22, 2024 •

edited

Loading