Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deviceSplitCount not honored #811

Open
cedricve opened this issue Jan 15, 2025 · 8 comments · May be fixed by #873
Open

deviceSplitCount not honored #811

cedricve opened this issue Jan 15, 2025 · 8 comments · May be fixed by #873
Labels
kind/bug Something isn't working

Comments

@cedricve
Copy link

Context:

We are using NVIDIA GPUs and are trying to limit 2 pods max per GPU. Reading through the documentation, we've learned this should be possible with the deviceSplitCount parameter. After following the guide, changing configmaps, restarting containers, the parameter is not honoured (though being correctly modified).

Issue:

We are able to schedule more than 2 pods. On the other hand if we try to schedule more than 10 pods on a GPU, it also sticks to the default value of 10.

https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/templates/scheduler/device-configmap.yaml#L23C7-L23C23

@cedricve cedricve added the kind/bug Something isn't working label Jan 15, 2025
@Nimbus318
Copy link
Contributor

Hi @cedricve,

Could you please provide the value of the hami.io/node-nvidia-register key in the node annotations after modifying the ConfigMap and restarting the hami-device-plugin? This will help us diagnose the issue.

Thanks!

@mrabiaa
Copy link

mrabiaa commented Jan 23, 2025

Hi @Nimbus318

the value of key is "hami.io/node-nvidia-register: 'GPU-3fe7a6f9-8bfb-3edd-daa9-610a511adf22,10,24564,100,NVIDIA-NVIDIA"

@Nimbus318
Copy link
Contributor

The second field, 10, represents the maximum number of splits for this GPU. I would like to confirm the current issue: After setting deviceSplitCount in the ConfigMap to a value greater than 10 and restarting the hami-device-plugin, the second field in the hami.io/node-nvidia-register annotation remains 10. Is that correct?

@cedricve
Copy link
Author

That is correct. @mrabiaa can you also confirm this.

@mrabiaa
Copy link

mrabiaa commented Feb 15, 2025

Hello @Nimbus318 I just edit the config map and restart the pods of Hami and still the same issue
$ kubectl get node -o jsonpath='{.metadata.annotations.hami.io/node-nvidia-register}'
GPU-c1f6532b-9a21-0a33-3f52-b7a173c66ee8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true

$ kubectl get cm -n kube-system hami-device-plugin -o yaml
apiVersion: v1
data:
config.json: |
{
"nodeconfig": [
{
"name": "m5-cloudinfra-online02",
"devicememoryscaling": 1.8,
"devicesplitcount": 20,
"migstrategy":"none",
"filterdevices": {
"uuid": [],
"index": []
}
}
]
}
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: hami
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-11-21T02:00:45Z"
labels:
app.kubernetes.io/component: hami-device-plugin
app.kubernetes.io/instance: hami
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: hami
app.kubernetes.io/version: 2.4.1
helm.sh/chart: hami-2.4.1
name: hami-device-plugin
namespace: kube-system
resourceVersion: "34659059"
uid: ea7385b9-8b1c-4b82-a3ce-dcf8328647f1

@Nimbus318
Copy link
Contributor

@mrabiaa @cedricve Thank you for your investigation. I can confirm that the issue is related to the node name configuration in the ConfigMap. The name field in the nodeconfig section needs to match exactly with your target node's name (the node where you want to set devicesplitcount: 20).
To fix this, update your ConfigMap with the correct node name:

"nodeconfig": [
  {
    "name": "<your-actual-node-name>",  # Replace with your node name
    "devicememoryscaling": 1.8,
    "devicesplitcount": 20,
    "migstrategy": "none",
    "filterdevices": {
      "uuid": [],
      "index": []
    }
  }
]

After updating the ConfigMap with the correct node name, remember to restart the HAMi device plugin pods for the changes to take effect.

@mrabiaa
Copy link

mrabiaa commented Feb 16, 2025

@Nimbus318 awesome , solved
thank you

@cedricve
Copy link
Author

Great thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants