-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deviceSplitCount not honored #811
Comments
Hi @cedricve, Could you please provide the value of the Thanks! |
Hi @Nimbus318 the value of key is "hami.io/node-nvidia-register: 'GPU-3fe7a6f9-8bfb-3edd-daa9-610a511adf22,10,24564,100,NVIDIA-NVIDIA" |
The second field, 10, represents the maximum number of splits for this GPU. I would like to confirm the current issue: After setting |
That is correct. @mrabiaa can you also confirm this. |
Hello @Nimbus318 I just edit the config map and restart the pods of Hami and still the same issue
|
@mrabiaa @cedricve Thank you for your investigation. I can confirm that the issue is related to the node name configuration in the ConfigMap. The "nodeconfig": [
{
"name": "<your-actual-node-name>", # Replace with your node name
"devicememoryscaling": 1.8,
"devicesplitcount": 20,
"migstrategy": "none",
"filterdevices": {
"uuid": [],
"index": []
}
}
] After updating the ConfigMap with the correct node name, remember to restart the HAMi device plugin pods for the changes to take effect. |
@Nimbus318 awesome , solved |
Great thank you! |
Context:
We are using NVIDIA GPUs and are trying to limit 2 pods max per GPU. Reading through the documentation, we've learned this should be possible with the deviceSplitCount parameter. After following the guide, changing configmaps, restarting containers, the parameter is not honoured (though being correctly modified).
Issue:
We are able to schedule more than 2 pods. On the other hand if we try to schedule more than 10 pods on a GPU, it also sticks to the default value of 10.
https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/templates/scheduler/device-configmap.yaml#L23C7-L23C23
The text was updated successfully, but these errors were encountered: