Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8s workload testing agent #669

Merged
merged 1 commit into from
Dec 6, 2022

Conversation

stmcginnis
Copy link
Contributor

@stmcginnis stmcginnis commented Nov 23, 2022

Issue number:

Closes: #616

Description of changes:

This adds a test agent for running cluster workload tests. These are tests that run some sort of workload on the cluster to verify system functionality.

Testing done:

  • Created test container image, manually stepped through the sonobuoy commands in the agent and verified the job runs and is successful on aws-k8s-nvidia nodes.
  • Followed the developer instructions for running a test, replacing example-test-agent with k8s-workload-agent and its settings. Unable to run the NVIDIA tests against a local kind cluster, but this verified things were deployed and status reported back. This along with the previous step that verified the workload test itself indicates things would perform correctly in an actual test (non-kind) cluster.
  • More testing to come!

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@stmcginnis stmcginnis marked this pull request as draft November 23, 2022 20:25
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

Dockerfile Show resolved Hide resolved
plugins:
- name: nvidia-workload
image: testsys-nvidia-workload-test:v0.0.3
image: <your k8s-workload-agent image URI>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put the expected image URI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't include that in any of the other test agent examples. Maybe after this we can go through and update them all of we have a concrete image URI for all agents.

bottlerocket/agents/src/sonobuoy.rs Outdated Show resolved Hide resolved
Comment on lines 52 to 72
let kubeconfig_arg = vec!["--kubeconfig", kubeconfig_path];
let status = Command::new(SONOBUOY_BIN_PATH)
.args(kubeconfig_arg.to_owned())
.arg("run")
.arg("--wait")
.arg("--namespace")
.arg("testsys-workload")
.args(plugin_test_args)
.status()
.context(error::WorkloadProcessSnafu)?;
info!("Workload testing has completed, checking results");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably shouldn't be in this pr, but it would be nice to not use --wait on the run command incase connection to the cluster is broken for a short period of time. It would be better to sonobuoy run and then routinely check sonobuoy status. (This also should be fixed in the sonobuoy agent)

bottlerocket/types/src/agent_config.rs Outdated Show resolved Hide resolved
bottlerocket/agents/src/bin/k8s-workload-agent/main.rs Outdated Show resolved Hide resolved
let mut num_failed: u64 = 0;
let mut num_skipped: u64 = 0;
let mut progress = Vec::new();
let mut outcome_summary = HashMap::from([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this map. A struct that implements default would be nicer imo.

Comment on lines +160 to +164
("pass", 0),
("passed", 0),
("fail", 0),
("failed", 0),
("timeout", 0),
("timed-out", 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the theoretical struct you wouldn't need separate fields for the alternate spellings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need a clearer picture of what you're envisioning here. It looks like plugins can report either variation for each result, so somewhere we need to evaluate them.

My thought with this structure was we wouldn't have to check each value every time, just collect it all up and get the totals at the end.

.context(error::MissingSonobuoyStatusFieldSnafu { field: "plugins" })?;

for result in plugin_results {
let plugin = result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have somewhere in TestResults that we could report the list of plugins that reported failuers? I guess we could put in OtherInfo along with the progress report(s)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe something like:

if (results_status == "fail" || results_status == "failed") && progress_status.is_empty() {
    progress.push(format!("{}: Failed", plugin);
}

bottlerocket/agents/src/sonobuoy.rs Outdated Show resolved Hide resolved
);

// Write out the output to a file we can reference later
let mut f = File::create(plugin_yaml.clone()).context(error::WorkloadProcessSnafu)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this path could save somebody some troubleshooting time someday.

Suggested change
let mut f = File::create(plugin_yaml.clone()).context(error::WorkloadProcessSnafu)?;
let plugin_yaml = PathBuf::from(".")
.join(format!("{}-plugin.yaml", plugin.name))
.canonicalize()
.context(error::BadPathSnafu{ path: plugin_yaml.clone())?;
let mut f = File::create(&plugin_yaml).context(error::FileWrite{ path: plugin_yaml.clone()})?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, with some slight modifications to make this work right. Please confirm though.

bottlerocket/agents/src/workload.rs Outdated Show resolved Hide resolved
bottlerocket/agents/src/workload.rs Outdated Show resolved Hide resolved
bottlerocket/agents/src/workload.rs Outdated Show resolved Hide resolved
bottlerocket/agents/src/workload.rs Outdated Show resolved Hide resolved
@stmcginnis stmcginnis marked this pull request as ready for review December 1, 2022 14:33
Copy link
Contributor

@ecpullen ecpullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit

bottlerocket/agents/src/sonobuoy.rs Outdated Show resolved Hide resolved
bottlerocket/agents/src/workload.rs Show resolved Hide resolved
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the "integration" tests are not happy.

This adds a test agent for running cluster workload tests. These are
tests that run some sort of workload on the cluster to verify system
functionality.

Signed-off-by: Sean McGinnis <[email protected]>
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the failed integ test run just a fluke?

@stmcginnis
Copy link
Contributor Author

Was the failed integ test run just a fluke?

No, it was a legitimate unit test failure due to a quick "safe" change I had made in the last update. Sorry, I thought I had commented on this PR, but no comment here. (Where did I post that?!). If you look at the Compare from the last force-push, it was a small change due to an optional value we look for in the results.

@stmcginnis stmcginnis merged commit e25fc10 into bottlerocket-os:develop Dec 6, 2022
@stmcginnis stmcginnis deleted the workload-testing branch December 6, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

workload test agent for k8s/sonobuoy
4 participants