Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kind, check-cluster-up: Enable Kubevirt CPUManager FG when SR-IOV provider is tested #1348

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ormergi
Copy link
Contributor

@ormergi ormergi commented Jan 19, 2025

What this PR does / why we need it:
The "check-up-kind-sriov" is turned optional and not gating because it constantly failing, due to following tests failures:

SRIOV VMI connected to single SRIOV network should have cloud-init meta_data with tagged interface and aligned cpus to sriov interface numa node for VMIs with dedicatedCPUs
SRIOV VMI connected to single SRIOV network [test_id:3959]should create a virtual machine with sriov interface and dedicatedCPUs

The mentioned tests causing the lane to fail following removal of programmatic skips in kubevirt/kubevirt tests kubevirt/kubevirt#13144, affecting the mentioned tests.
Previously the mentioned tests were skipped silently (bad) and now, following the programmatic skip removal, fail loudly.
The root cause for the failures (or previous skips) is tests depends on Kubevirt's CPUManager feature but its not enabled at all, see below notes section for more details *.

This PR fixes the lane by enabling Kubevirt's CPUManager features when SR-IOV provider is tested.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  • When Kubevirt's CPUManager feature is on, it will label supporting nodes with cpumanager=ture label (done by the heartbeat controller).
    The failing tests, creates VMs with dedicated-CPUs option, Kubevirt will label such VM's virt-launcher pod with node-selector signifying cpumanager=true label.
    The end result is the tested VMs fail to become ready on time due to impossible scheduling; VMs has cpumanager=ture node selector, but no node has cpumanager=true label.

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

kind/check-cluster-up.sh enable Kubevirt's CPUManager feature when the SR-IOV provider is tested.

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jan 19, 2025
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign brianmcarey for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ormergi
Copy link
Contributor Author

ormergi commented Jan 19, 2025

/test check-up-kind-sriov

@ormergi ormergi force-pushed the check-kind-up-cpumanager branch from 3476f78 to ae4e2e5 Compare January 19, 2025 12:21
@ormergi ormergi force-pushed the check-kind-up-cpumanager branch from ae4e2e5 to 92b836a Compare January 19, 2025 12:43
@ormergi ormergi force-pushed the check-kind-up-cpumanager branch from 92b836a to 6a5991f Compare January 19, 2025 12:55
@ormergi
Copy link
Contributor Author

ormergi commented Jan 19, 2025

@ormergi
Copy link
Contributor Author

ormergi commented Jan 19, 2025

/cc @EdDev @orelmisan @nirdothan

ormergi added a commit to ormergi/project-infra that referenced this pull request Jan 19, 2025
… constantly failing (kubevirt#3878)"

This reverts commit 86cf6b7.

The PR kubevirt/kubevirtci#1348 fixes the issue
and stabilize the lane.

Signed-off-by: Or Mergi <[email protected]>
@ormergi ormergi changed the title kind: Enable Kubevirt CPUManager FG when env supports it kind,check-cluster-up: Enable Kubevirt CPUManager FG when env supports it Jan 19, 2025
Copy link
Member

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @ormergi.
Could you please give a few words about why is the PR needed?

@@ -69,6 +69,11 @@ export CRI_BIN=${CRI_BIN:-$(detect_cri)}
fi
${kubectl} wait -n kubevirt kv kubevirt --for condition=Available --timeout 15m

if [[ "$KUBEVIRT_PROVIDER" =~ "sriov" ]]; then
# Some SR-IOV tests require Kubevirt CPUManager feature
${kubectl} patch kubevirts -n kubevirt kubevirt --type=json -p='[{"op": "replace", "path": "/spec/configuration/developerConfiguration/featureGates","value": ["CPUManager"]}]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding a feature gate to the end of the existing list, instead of replacing the whole list, as it will enable future expansion.

Copy link
Contributor Author

@ormergi ormergi Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case additional FG should be enabled I think there should still be a single patch call with all necessary FGs.
The FG names can be aggregated and then passed to the patch call.
I didnt exported the FG name to var because its the only one at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should not come as a hard dependency of the SR-IOV provider.
Do you see a problem with directly marking the need to have CPUManager as input from the caller?

Also, the patching is odd to me too.

  • Why do you use replace and not a simple add?
  • Why do you think it is better to assume there is only one FG? It will just make it harder for the next contributor to add other FGs in general.

Copy link
Contributor Author

@ormergi ormergi Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see problem with setting CPUManager feature to always on, in fact in kubevirt/kubevirt its always on.
We can go with that. Let me know what you think.
EDIT: In kubevirt/kubevirt tests, CPUManager FG is enabled unless the env architecture is s390x (it used to be always on), it might be necessary following the same logic here.
I also updated the PR title & description to express that it affects SR-IOV provider only.

Regarding the patch, I used "replace" because "add" didn't work for me in away I can add FG with one-liner.
When Kubevirt CR developerConfiguration.featureGates is not initialized, it will require patch for initializing it and another one for adding the FG.
Using "replace" the way I did enable having one-liner.

Passing the provider name is the most simple way I could find to solve it and get the lane gating as soon as possible.
We can introduce some additional env var to hold a FG list, on a follow up PR in case it will be needed (I tried to keep things simple).

Copy link
Contributor

@dhiller dhiller Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi I am not convinced that here is the right place to patch kubevirt? IMHO it should go into https://github.com/kubevirt/kubevirt/tree/main/hack/cluster-deploy.sh . WDYT?

EDIT: My reasoning behind that is that kubevirt is not a component of kubevirtci.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianmcarey FYI ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhiller Changing deploy-cluster.sh affect cluster-sync flow, I am not sure its we should do that.
Please note "check-up-kind-sriov" calls "kind/check-cluster-up.sh" directly, and "kind/check-cluster-up.sh" deploy kubevirt (from nightly release yamls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhiller WDYT?

@ormergi
Copy link
Contributor Author

ormergi commented Jan 19, 2025

Thank you for the PR @ormergi. Could you please give a few words about why is the PR needed?

Done

@ormergi
Copy link
Contributor Author

ormergi commented Jan 19, 2025

/test check-up-kind-sriov

@orelmisan
Copy link
Member

Thank you for the PR @ormergi. Could you please give a few words about why is the PR needed?

Done

Thank you.
Could you please add it to the commit message as well?

The "check-up-kind-sriov" is turned optional and not gating because it
constantly failing, due to following tests failures:
  SRIOV VMI connected to single SRIOV network
    should have cloud-init meta_data with tagged interface and
    aligned cpus to sriov interface numa node for VMIs with dedicatedCPUs
  SRIOV VMI connected to single SRIOV network
    [test_id:3959]should create a virtual machine with sriov interface and dedicatedCPUs

The mentioned tests causing the lane to fail following removal of programmatic
skips in kubevirt/kubevirt tests kubevirt/kubevirt#13144, affecting the
mentioned tests.
Previously the mentioned tests were skipped silently (bad) and now, following
the programmatic skip removal, fail loudly.
The root cause for the failures (or previous skips) is tests depends on
Kubevirt's CPUManager feature but its not enabled at all.

Enable Kubeivrt CPUManager FG when the SR-IOV provider is tested.

Signed-off-by: Or Mergi <[email protected]>
@ormergi ormergi force-pushed the check-kind-up-cpumanager branch from 6a5991f to 56a5439 Compare January 20, 2025 00:27
@ormergi
Copy link
Contributor Author

ormergi commented Jan 20, 2025

Thank you for the PR @ormergi. Could you please give a few words about why is the PR needed?

Done

Thank you. Could you please add it to the commit message as well?

Done

@ormergi ormergi changed the title kind,check-cluster-up: Enable Kubevirt CPUManager FG when env supports it kind, check-cluster-up: Enable Kubevirt CPUManager FG when SR-IOV provider is tested Jan 20, 2025
@kubevirt-bot
Copy link
Contributor

@ormergi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
check-provision-k8s-1.31-s390x 56a5439 link false /test check-provision-k8s-1.31-s390x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/enhancement size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants