Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move arcticdata.io (Production) to Kubernetes #1954

Open
19 of 25 tasks
artntek opened this issue Aug 14, 2024 · 2 comments
Open
19 of 25 tasks

Move arcticdata.io (Production) to Kubernetes #1954

artntek opened this issue Aug 14, 2024 · 2 comments
Assignees
Labels
k8s Kubernetes/Helm Related

Comments

@artntek
Copy link
Contributor

artntek commented Aug 14, 2024

Similar to #1932;

checklist:

  • Work with @nickatnceas to copy production data for testing:
    • Time how long it takes to...
      • Copy the production postgres data (arcticdata.io:/var/lib/postgresql) to the PROD ceph volume at /mnt/ceph/repos/arctic/postgresql (treat it like a hot backup).

        • NOTE: we do not need the /var/lib/postgresql/10 directory
      • copy the following subset of production data from arcticdata.io:/var/metacat to the PROD ceph volume at /mnt/ceph/repos/arctic/metacat:

        # /var/metacat/...
        16K	    ./certs
        63T	    ./data
        8.0K        ./dataone
        3.9G        ./documents
        0           ./inline-data
        500K        ./logs
        • Actual Times taken for /var/metacat/data:
          • initial rsync
            root@arctica:/var/metacat# time rsync -aHAX --delete /var/metacat/data/ /mnt/pdg/repos/arctic/metacat/data/
            
            real    14286m43.628s
            user    1131m15.740s
            sys     3907m38.871s
            ## -> 9.92 days
          • subsequent repeat rsync
            brooke@arctica:~$ time sudo rsync -rltDHX  /var/metacat/data/ /mnt/pdg/repos/arctic/metacat/data/
            [sudo] password for brooke:
            
            real	4m19.047s
            user	0m15.747s
            sys	0m34.912s

Follow the Quick Reference: Metacat K8s Installation Steps. Supplementary TODOs below...

Persistent Volumes

  • Set up a PV to point to PROD cephfs .../repos/arctic/metacat for metacat
  • Set up a PV to point to PROD cephfs .../repos/arctic/postgres for postgres
  • Create a PVC for Postgresql; see prod_cluster/metacatarctic/pvc--metacatarctic-postgres.yaml
  • "csi-cephfs-sc-ephemeral" storageClass missing. Ask @nickatnceas to add, like he did for dev cluster:
  storageclass.storage.k8s.io "csi-cephfs-sc-ephemeral" not found

MetacatUI setup

  • Copy config (tokens) from adc server

Metacat Config

  • Add values.yaml overrides for non-default 2.19 settings (diff arcticdata.io $TOMCAT_HOME/webapps/metacat/WEB-INF/metacat.properties with default metacat.properties from 2.19 release)
  • Add values.yaml overrides for newly-introduced 3.0 settings (diff default metacat.properties from 3.0.0 release with default metacat.properties from 2.19 release)
  • Compare with test.arcticdata.io values overrides as a sanity check

First Deployment

  • Complete steps in "First Install - IMPORTANT IF MOVING DATA FROM AN EXISTING LEGACY DEPLOYMENT" BEFORE first startup!

  • solr pods not starting. root cause from logs:

    $ kc logs pod/metacatarctic-solr-1
      /scripts/setup.sh: line 8: /opt/bitnami/scripts/solr/entrypoint.sh: Permission denied

    SOLVED - was overriding extraVolumes values, and the override didn't include the permissions line

  • https://arctic-prod.test.dataone.org/catalog/ (trailing slash) works, but https://arctic-prod.test.dataone.org/catalog gives a 404 (nginx)

  • ensure all data and documents files are group writeable (otherwise, hashstore upgrader can't create hard links):

    sudo find /mnt/ceph/repos/arctic/metacat/data/ -type f ! -perm -g=w -exec chmod g+w {} +
  • chown -R 59997:59997 the ceph dir corresponding to /var/metacat, and update values.yaml to use this uid:gid

    brooke@datateam:/mnt/ceph/repos/arctic$ time sudo chown -R 59997:59997 metacat
    
    real	4m7.026s
    user	0m0.004s
    sys	0m0.027s
  • Hostname aliases and rewrite rules

    • Figure out how to do these with ingress; see all-sites-enabled.conf. Lots of complexity - eg http://aoncadis.org aliased to adc.io, and site conf has RewriteMaps each having >3700 entries.
    • EXPLANATION: aoncadis.org was the predecessor to the ADC site. These rewrite rules map existing, old dataset urls to their new locations on ADC - so these rewrites need to be maintained somewhere
    • Leave all the redirects/other sites on the current Apache host, and move only arcticdata.io.

ATTENTION: Still To Do Before Final Deployment

  • Time hashstore conversion

  • Time reindex-all

  • MetacatUI + WordPress setup. How do we host it and link to k8s metacat?

    • ACTION: use a wordpress image/bitnami chart, deployed separately from the metacat helm chart
  • ACTION: Ask @nickatnceas for help with letsencrypt certs - do we need to remove arcticdata.io from wildcard cert on arctica? NOTE: we still need subdomain certs there (ie status.adc, beta.adc).

  • Skip 3.0.0 and deploy 3.1.0, but only after it's been running on less-trafficked hosts for a while. See proposed release plan in Issue Metacat 3.1.0 Release Plan #1984.

Testing - see Matt's comment below

@artntek artntek added the k8s Kubernetes/Helm Related label Aug 14, 2024
@artntek artntek self-assigned this Aug 14, 2024
@mbjones
Copy link
Member

mbjones commented Oct 15, 2024

For the Testing section, here's a quick rundown:

Get the R package dataone installed

  • Download and install R (required) and RStudio (nice but optional)
  • Checkout rdataone and (probably) switch to the develop branch, depending on what you need to test
  • open rdataone/dataone.RProj in RStudio
  • run install.packages(c('remotes', 'devtools'))
  • run devtools::load_all() to load the current dataone library code for testing
  • run remotes::install_deps() to install all of the package dependencies

Run the tests against standard nodes

  • login to https://dev.nceas.ucsb.edu and copy your token for R from the web UI; paste the token options command into the R console and run it
  • run devtools::test() to run the original tests against standard nodes

To run tests against a different node

@artntek
Copy link
Contributor Author

artntek commented Nov 12, 2024

hashstore conversion notes

First conversion (with errors) took almost exactly 48 hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes/Helm Related
Projects
None yet
Development

No branches or pull requests

2 participants