-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable multiple node pools [EPIC] #18709
Comments
Open questions
Impacts
|
Feedback from stakeholders: @ebensom :
@varbanv :
Currently supported worker parameter in |
Next steps / Action items:
|
In general, I see a bigger demand for GPUs explicitly requested by different teams, some about AI, others for ML algorithms. The scope, in any case, is always to have dedicated nodes to run specific tasks. Reasonable also to include m6g and m6in (or the current available generation) for SAP for Me |
We will go simple on KEB side. We will keep those (mandatory) parameters on root for system node pool(We will adjust descriptions):
NOTE: this is always HA min 3 nodes. We need to decide how to name that worker node pool, probably we use some name right now. and new (optional )array of worker nodes for customer usage:
NOTE: for now same validation as for system ones, thta means HA is mandatory. About machineTypes we keep what we have for now, not extending that. Reason is that we first need to focus to run Kyma modules only in the system worker node pool. And second reason is that existing KMC will work without changing anything. Later when we will release that and see that everything works we can add new machine Types including GPU etc, that requires to adjust billing etc. Cheers, PK |
Hello, We have some workloads, that require a very high amount of memory in a single operation. The requirements can go up to 128 GB of RAM. Of course we do not want to run all nodes of our cluster with 128 GB machines, cause this would be very expensive. The operations itself can not be optmized with low effort (We are generating large export files for power point and PDF and the third party libraries we are using for this, do not support streamed or chunked exports, they require to hold all in memory). So for us it would be important to have system node pool with small machines (like 16 GB or 32 GB) and than an additional node pool for the heavy workloads (like 128 GB machines). It would be important for us to be able to scale down the additional node pool to zero, cause we only need the expensive machines in case there are heavy workloads. So in the moment a user queues in a heavy workload, we would spawn a pod on the additional node pool, the node pool should scale up, executes the workload (which typically needs some hours) and then scale down to zero, after the workloads are done. We do not require to have new machine types, like GPU or ARM machines. I hope this is a state we can reach at some point. As I understand, it's currently planned to release additional node pools with HA , so they have to have at least 3 nodes permanently, without the option to scale to zero? Kind regards, |
Hi @ChristophRothmeier , thanks for your request. The multiple worker pool feature is currently in implementation and will be rolled out till end of this year. The list of supported machine types is at the beginning not extended and includes the same machine types as we offer when creating a new Kyma runtime via BTP cockpit. But support for additional machine types is already agreed and will be added soon after the worker pool feature is productive. For go-live, we will also offer only worker pools with HA support (means, 3 nodes are the minimum). Scaling to 0 nodes is with a HA-supporting worker pool not possible but can be achieved by dropping the worker pool and re-creating it afterwards. We are already in discussions to allow non-HA supporting worker pools with < 3 nodes. Such pools would also allow a scaling to 0 nodes. |
Hi Tobias, thanks for the response. Thanks |
This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. |
Hi @ChristophRothmeier -we are currently working on the worker-pool implementation. We are not sure if we can support scaling to 0 nodes as it seems that Gardener expects that the |
Hi @tobiscr For example, we are currently running 64 GB nodes in some locations, cause some single heavy workloads need that much of RAM. With additional worker nodes beeing able to be scaled to 1, we could downsize the main machines to 16 GB, with an additional worker pool with 1x 64 GB machine (scaling up to more dynamically on more workload). |
Description
Kyma clusters should support multiple machine types simultaneously. For example GPU and ARM nodes, network, memory and CPU optimized nodes, etc.
Acceptance criteria:
metering is adjusted to different node types with a multiplying factor that reflects price differences (e.g. GPU has factor 2.0)-> descoping as it's addressed in Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195Reasons
Our customers demand ARM and GPU nodes in Kyma clusters to run their workload on the architecture supporting their use cases. Examples:
Related issues
The text was updated successfully, but these errors were encountered: