-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add k8s workload testing agent #669
Add k8s workload testing agent #669
Conversation
70fa21f
to
37baac9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
plugins: | ||
- name: nvidia-workload | ||
image: testsys-nvidia-workload-test:v0.0.3 | ||
image: <your k8s-workload-agent image URI> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put the expected image URI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't include that in any of the other test agent examples. Maybe after this we can go through and update them all of we have a concrete image URI for all agents.
bottlerocket/agents/src/workload.rs
Outdated
let kubeconfig_arg = vec!["--kubeconfig", kubeconfig_path]; | ||
let status = Command::new(SONOBUOY_BIN_PATH) | ||
.args(kubeconfig_arg.to_owned()) | ||
.arg("run") | ||
.arg("--wait") | ||
.arg("--namespace") | ||
.arg("testsys-workload") | ||
.args(plugin_test_args) | ||
.status() | ||
.context(error::WorkloadProcessSnafu)?; | ||
info!("Workload testing has completed, checking results"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably shouldn't be in this pr, but it would be nice to not use --wait
on the run command incase connection to the cluster is broken for a short period of time. It would be better to sonobuoy run
and then routinely check sonobuoy status
. (This also should be fixed in the sonobuoy agent)
let mut num_failed: u64 = 0; | ||
let mut num_skipped: u64 = 0; | ||
let mut progress = Vec::new(); | ||
let mut outcome_summary = HashMap::from([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this map. A struct that implements default would be nicer imo.
("pass", 0), | ||
("passed", 0), | ||
("fail", 0), | ||
("failed", 0), | ||
("timeout", 0), | ||
("timed-out", 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the theoretical struct you wouldn't need separate fields for the alternate spellings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may need a clearer picture of what you're envisioning here. It looks like plugins can report either variation for each result, so somewhere we need to evaluate them.
My thought with this structure was we wouldn't have to check each value every time, just collect it all up and get the totals at the end.
.context(error::MissingSonobuoyStatusFieldSnafu { field: "plugins" })?; | ||
|
||
for result in plugin_results { | ||
let plugin = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have somewhere in TestResults
that we could report the list of plugins that reported failuers? I guess we could put in OtherInfo along with the progress report(s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe something like:
if (results_status == "fail" || results_status == "failed") && progress_status.is_empty() {
progress.push(format!("{}: Failed", plugin);
}
bottlerocket/agents/src/workload.rs
Outdated
); | ||
|
||
// Write out the output to a file we can reference later | ||
let mut f = File::create(plugin_yaml.clone()).context(error::WorkloadProcessSnafu)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this path could save somebody some troubleshooting time someday.
let mut f = File::create(plugin_yaml.clone()).context(error::WorkloadProcessSnafu)?; | |
let plugin_yaml = PathBuf::from(".") | |
.join(format!("{}-plugin.yaml", plugin.name)) | |
.canonicalize() | |
.context(error::BadPathSnafu{ path: plugin_yaml.clone())?; | |
let mut f = File::create(&plugin_yaml).context(error::FileWrite{ path: plugin_yaml.clone()})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, with some slight modifications to make this work right. Please confirm though.
37baac9
to
7cfdb40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit
7cfdb40
to
4184499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the "integration" tests are not happy.
This adds a test agent for running cluster workload tests. These are tests that run some sort of workload on the cluster to verify system functionality. Signed-off-by: Sean McGinnis <[email protected]>
4184499
to
1e269d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the failed integ test run just a fluke?
No, it was a legitimate unit test failure due to a quick "safe" change I had made in the last update. Sorry, I thought I had commented on this PR, but no comment here. (Where did I post that?!). If you look at the Compare from the last force-push, it was a small change due to an optional value we look for in the results. |
Issue number:
Closes: #616
Description of changes:
This adds a test agent for running cluster workload tests. These are tests that run some sort of workload on the cluster to verify system functionality.
Testing done:
example-test-agent
withk8s-workload-agent
and its settings. Unable to run the NVIDIA tests against a local kind cluster, but this verified things were deployed and status reported back. This along with the previous step that verified the workload test itself indicates things would perform correctly in an actual test (non-kind) cluster.Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.