Skip to content

Commit

Permalink
add more details to prediction and clarify other docs
Browse files Browse the repository at this point in the history
  • Loading branch information
cmelone committed Feb 5, 2024
1 parent 0ff4187 commit 7ff1993
Show file tree
Hide file tree
Showing 4 changed files with 109 additions and 13 deletions.
7 changes: 2 additions & 5 deletions docs/arch.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# Architecture

![flow chart showing spack-gantry architecture](./img/arch.png)

Alec mocked up a wonderful flow chart showing our vision for this project!

**To implement**:
- GitLab webhooks may occasionally fail to send, could poll the jobs once per hour and collect any stragglers.
- [Litestream](https://litestream.io) + SQLite for database
![flow chart showing spack-gantry architecture](./img/arch.png)
25 changes: 20 additions & 5 deletions docs/context.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,42 @@
# Context

Build definitions designated to be built within Spack CI are manually allotted to categories that reflect the resources a job is expected to consume. These amounts are used to allocate multiple jobs as efficiently as possible on a CI runner. Without accurate information about how much memory or CPU a compile will use, there is opportunity for misallocation, which can have effects on the following components of the CI system:
Build definitions compiled in Spack's CI are manually allotted to categories that reflect the resources expected to be consumed. These amounts are used to allocate multiple jobs as efficiently as possible on a CI runner. Without accurate information about how much memory or CPU a compile will use, there is opportunity for misallocation, which can have effects on the following components of the CI system:

- Cost per job
- Build walltime
- Efficiency of VM packing
- Utilization of resources per job
- Build failures due to lack of memory
- Build error rate
- Throughput
- Overall throughput

Also, jobs with mismatched time estimates are being allocated to instances inappropriately, leading to situations where many small jobs complete, while a larger job uses the instance for the rest of its duration without using every available cycle. We are currently retrying jobs up to 3 times in order to work around stochastic CI failures, which leads to more potential waste if the error was valid. Instead, we would like to retry jobs if the cause of termination was resource contention.

Due to a problems of scale and inability to quickly and correctly determine the resource demand of a given build, we have decided that an automated framework that captures historical utilization and outputs predictions for future usage is the best path forward.
Due to a problems of scale and inability to manually determine the resource demand of a given build, we have decided that an automated framework that captures historical utilization and outputs predictions for future usage is the best course of action.

With this setup, we can transition to a system where build jobs will request the appropriate amount of resources, which will reduce waste and contention with other jobs within the same namespace. Additionally, by amassing a vast repository of build attributes and historical usage, we can further analyze the behavior of these jobs and perform experiments within the context of the CI. For instance, understanding why certain packages are more variable in their memory usage during compilation, or determining if there is a "sweet spot" that minimizes resource usage but leads to an optimal build time for a given configuration.
With this setup, we can transition to a system where build jobs will request the appropriate amount of resources, which will reduce waste and contention with other jobs within the same namespace. Additionally, by amassing a vast repository of build attributes and historical usage, we can further analyze the behavior of these jobs and perform experiments within the context of the CI. For instance, understanding why certain packages are more variable in their memory usage during compilation, or determining if there is a "sweet spot" that minimizes resource usage but leads to an optimal build time for a given configuration (i.e. a scaling study).

A corallary to this is building a system that handles job failures with some intelligence. For instance, if a build was OOM killed, `gantry` would submit the same job and supply it with more memory. Jobs that produce errors due to other reasons would not be retried and resolved through the appropriate channels.
A corollary to this is building a system that handles job failures with some intelligence. For instance, if a build was OOM killed, `gantry` would submit the same job and supply it with more memory. Jobs that fail due to other reasons would be resolved through other channels.

## Current resource allocation

Each build job comes with a memory and CPU request. Kubernetes will use these numbers to allocate the job onto a specific VM. No limits are sent, meaning that a compilation could crowd out other jobs and that there are no consequences for going over what they are expected to utilize.

-----

To illustrate the problem, let's go through some usage numbers (across all builds):

**Memory**

- avg/request = 0.26
- max/request = 0.64

**CPU**

- avg/request = 1.25
- max/request = 2.69

There is a lot of misallocation going on here. As was said above, limits are not enforced, so request is the closest we're going to get to a useful comparison of usage. Bottom line, we are using a lot less memory than we request, and more CPU than we ask for.

## Cost per job

Expand Down
12 changes: 10 additions & 2 deletions docs/data-collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,17 @@

Job metadata is retrieved through the Spack Prometheus service (https://prometheus.spack.io).

Gantry exposes a webhook handler at `/v1/collection` which will accept a job status payload from Gitlab and collect build attributes and usage, submitting to the database.

See `/db/schema.sql` for a full list of the data that is being collected.

## Units

Memory usage is stored in bytes, while CPU usage is stored in cores. Pay special attention if you are interacting with relevant fields if you are performing calculations or sending data from these fields to Kubernetes or another external service. They may expect these values in different units.

------

Links to documentation for metrics available:
- [node](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/node-metrics.md)
- [pod](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md)
- [container](https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md)

To programmatically access Prometheus, one must send a copy of a session cookie with their request. In the future, we would be able to use a generated API token to authenticate.
78 changes: 77 additions & 1 deletion docs/prediction.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,79 @@
# Prediction

In the pilot phase of `gantry`, we only want allocations to increase, not decrease in order to reduce the risk of chaos ensuing if we are woefully incorrect.
The basic idea here is: given metadata about a future build, recommend resource requests and limits, as well as number of build jobs.

The goal is that this eventually becomes a self-learning system which can facilitate better predictions over time without much interference.

At the moment, this isn't accomplished through some super fancy model. Here's our approach for each type of prediction:

**Requests**

Optimizing for the mean usage / predicted mean as close to 1 as possible.

Formula: avg(mean_usage) for past N builds.

**Limits**

For CPU: optimize for the number of cores for best efficiency
For RAM: optimize for lack of killing (max usage / predicted max < 1)

This one is a bit tricker to implement because we would like to limit OOM kills as much as possible, but we've thought about allocating 10-15% above historical maxima for usage.

We could also figure out the upper threshold by calculating

= skimpiest predicted limit / maximum usage for that job

However, when doing this, I've stumbled upon packages that have unpredictable usage patterns that can swing from 200-400% of each other (with no discernable differences).

More research and care will be needed when we finally decide to implement limit prediction.

### Predictors

We've done some analysis to determine the best predictors of resource usage. Because we need to return a result regardless of the confidence we have in it, we've developed a priority list of predictors to match on.

1. `("pkg_name", "pkg_version", "compiler_name", "compiler_version")`
2. `("pkg_name", "compiler_name", "compiler_version")`
3. `("pkg_name", "pkg_version", "compiler_name")`
4. `("pkg_name", "compiler_name")`
5. `("pkg_name", "pkg_version")`
6. `("pkg_name",)`

Variants are always included as a predictor.

Our analysis shows that the optimal number of builds to include in the prediction function is five, though we prefer four if the program will drop down to the next set in the list.

We do not use PR builds as part of the training data, as they are potential vectors for manipulation and can be error prone. The predictions will apply to both PR and develop jobs.

## Plan

1. In the pilot phase, we will only be implementing predictions for requests, and ensuring that they will only increase compared to current allocations.

2. If we see success in the pilot, we'll implement functionality which retries jobs with higher memory allocations if they've been shown to fail due to OOM kills.

3. Then, we will "drop the floor" and allow the predictor to allocate less memory than the package is used to. At this step, requests will be fully implemented.

4. Limits for CPU and memory will be implemented.

5. Next, we want to introduce some experimentation in the system and perform a [scaling study](#fuzzing).

6. Design a scheduler that decides which instance type a job should be placed on based on cost and expected usage and runtime.

## Evaluation

The success of our predictions can be evaluated against a number of factors:

- How much cheaper is the job?
- Closeness of request or limit to actual usage
- Jobs being killed due to resource contention
- Error distribution of prediction
- How much waste is there per build type?

## Fuzzing

10-15% of builds would be randomly selected to have their CPU limit modified up or down. This would happen a few times for the build, and see if we can find an optimal efficiency for the job, which would be used to define future CPU limit and number of build jobs.

We're essentially adding variance to the resource allocation and seeing how the system responds.

This is a strong scaling study, and the plot of interest is the efficiency curve.

Efficiency defined as cores/build time.

0 comments on commit 7ff1993

Please sign in to comment.