first runner doc draft

LDMX-Software · Aug 14, 2023 · 1903d82 · 1903d82
1 parent 8a8d05e
commit 1903d82
Showing 1 changed file with 142 additions and 0 deletions.
diff --git a/docs/runner.md b/docs/runner.md
@@ -0,0 +1,142 @@
+# GitHub and Self-Hosted Runners
+
+We've run into the main obstacle with using the free, GitHub-hosted runners - 
+a single job is required to run for less than six hours. Instead of attempting
+to go through the difficult process of partitioning the image build into multiple
+jobs that each take less than six hours, we've chosen to attach a self-hosted
+runner to this repository which allows us to expand the time limit up to 35 days
+as we see fit. This document outlines the motivations for this change, how this
+was implemented, how to maintain the current self-hosted runner, and how to
+revert this change if desired.
+
+## Motivation
+As mentioned above, the build step for this build context lasts longer than six
+hours for non-native architecture builds. Most (if not all) of GitHub's runners
+have the amd64 architecture, but we desire to also build an image for the arm64
+architecture since several of our collaborators have arm-based laptops they are
+working on (most prominently, Apple's M series). A build for the native architecture
+takes roughly three hours while a build for a non-native architecture, which
+requires emulation with `qemu`, takes about ten times as long (about 30 hours).
+This emulation build is well over the six hour time limit of GitHub runners and
+would require some serious development of the Dockerfile and build process in
+order to cut it up into sub-jobs each of which were a maximum of six hours long
+(especially since some build _steps_ were taking longer than six hours).
+
+Putting all this information together, a natural choice was to move the building
+of these images to a self-hosted runner in order to (1) support more architectures
+besides amd64, (2) avoid an intricate (and potentially impossible) redesign of
+the build process, and (3) expand the job time limit to include a slower emulation
+build.
+
+## Implementation
+While individual builds take a very long time, we do not do a full build of the
+image very frequently. In fact, besides the OS upgrade, there hasn't been a change
+in dependencies for months at a time. Thus, we really don't require a highly
+performing set of runners. In reality, we simply need a single machine that can
+host a handful of runners to each build a single architecture image one at a time
+in a single-threaded, single-core manner.
+
+Once we enter the space of self-hosted runners, there is a lot of room to explore
+different customization options. The GitHub runner application runs as a user on
+the machine it is launched from so we could highly-specialize the environment of
+that machine so the actions it performs are more efficient. I chose to _not_ go
+this route because I am worried about the maintanability of self-hosted runners
+for a relatively small collaboration like LDMX. For this reason, I chose to attempt
+to mimic a GitHub runner as much as possible in order to reduce the number of changes
+necessary to the workflow definition - allowing future LDMX collaborators to stop
+using self-hosted runners if they want or need to.
+
+### Workflow Definition
+In the end, I needed to change the [workflow definition](../.github/workflows/ci.yml)
+in five ways.
+
+1. `runs-on: self-hosted` - tell GitHub to use the registered self-hosted runners rather than their own
+2. `timeout-minutes: 43200` - increase job time limit to 30 days to allow for emulation build to complete
+3. Add `type=local` cache at a known location within the runner filesystem
+4. Remove `retention-days` limit since the emulation build may take many days
+5. Add the `linux/arm64` architecture to the list of platforms to build on
+
+The local cache is probably the most complicated piece of the puzzle and
+I will not attempt to explain it here since I barely understand it myself.
+For future workflow developers, I'd point you to Docker's
+[cache storage backends](https://docs.docker.com/build/cache/backends/)
+and 
+[cache management with GitHub actions](https://docs.docker.com/build/ci/github-actions/cache/)
+documentation.
+The current implementation stores a cache of the layers created during
+the build process on the local filesystem (i.e. on the disk of the runner).
+These layers need to be separated by platform so that the different architectures
+do not interfere with each other.
+
+### Self-Hosted Runner
+GitHub has good documentation on 
+[Self-Hosted Runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners)
+that you should look at if you want to learn more.
+This section is merely here to document how the runners for this repository
+were configured.
+
+First, I should note that I put all of this setup inside a Virtual Machine on
+the computer I was using in order to attempt to keep it isolated. This is meant
+to provide some small amount of security since GitHub points out that a malicious
+actor could fork this repository and 
+[run arbitrary code on the machine by making a PR](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#self-hosted-runner-security)
+.[^1]
+
+- VM with 2 cores and ~3/4 of the memory of the machine
+- Ubuntu 22.04 Minimal Server
+- Install OpenSSH so we can connect to the VM from the host machine
+- Have `github` be the username (so the home directory corresponds to the directory in the workflow)
+- [Install docker engine](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)
+- Make sure `tmux` is installed so we can startup the runner and detach
+- Follow [post-install instructions](https://docs.docker.com/engine/install/linux-postinstall/) to allow `docker` to be run by users
+- Follow [Add a Self-Hosted Runner](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners)
+  treating _the VM_ as the runner and _not_ the host
+  - spawn a new runner for _each_ core, giving each runner its own working directory and its own copy of the actions-runner software
+  - add the `UMN` label to the runners during config so LDMX knows where they are
+
+I'd like to emphasize how simple this procedure was.
+GitHub has put a good amount of effort into making Self-Hosted runners easy to connect,
+so I'd encourage other LDMX institutions to contribute a node if the UMN one goes down.
+
+[^1]: Besides this isolation step, I further isolated this node by working with our
+IT department to take control of the node - separating it from our distributed filesystem
+hosting data as well as our login infrastructure.
+
+## Maintenance
+The maintenance for this runner is also relatively simple. Besides the obvious steps of
+checking to make sure that it is up and running on a periodic basis, someone at UMN[^2]
+should log into the node periodically to check how much disk space is available within
+the VM. This needs to be done because I have implemented the workflow and the runner to
+**not** delete the layer cache **ever**. This is done because we do not build the image
+very frequently and so we'd happily keep pre-built layers for months or longer if it
+means adding a new layer on at the end will build faster.
+
+A full build of the image from scratch takes ~1.8GB of cache and we have allocated ~70GB
+to the cache inside the VM. This should be enough for the foreseeable future.
+
+Future improvements to this infrastructure could include adding a workflow whose job
+is to periodically connect to the runner, check the disk space, and - if the disk space
+is lacking space - attempt to clean out some layers that aren't likely to be used again.
+`docker buildx` has cache maintenance tools that we could leverage **if** we specialize
+the build action more by having the runner be configured with pre-built docker builders
+instead of using the `setup-buildx` action as a part of our workflow. I chose to _not_ go
+this route so that it is easier to revert back to GitHub building were persisting docker
+builders between workflow runs is not possible.
+
+[^2]: For UMN folks, the username/password for the node and the VM within it are written
+down in the room the physical node is in. The node is currently in PAN 424 on the central
+table.
+
+## Revert
+This is a somewhat large infrastructure change, so I made the concious choice to leave
+reversion easy and accessible. If a self-hosted runner becomes infeasible or GitHub changes its policies
+to allow longer job run times (perhaps through some scheme of total job run time limit
+rather than individual job run time limit), we can go back to GitHub-hosted runners for
+the building by updating the workflow.
+
+1. `runs-on: ubuntu-latest` - use GitHub's ubuntu-latest runners
+2. remove `timeout-minutes: 43200` which will drop the time limit back to whatever GitHub imposes (6hrs right now)
+3. remove caching parameters (GitHub's caching system is too short-lived and too small at the free tier to be useful for our builds)
+5. remove the `linux/arm64` architecture from the list of platforms to build on (unless GitHub allows jobs a longer run time)
+    - Images can still be built for the arm architecture, but they would need to happen manually by the user with that computer
+      or by someone willing to run the emulation and figure out how to update the manifest for an image tag