Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable multiple node pools [EPIC] #18709

Open
3 of 5 tasks
pbochynski opened this issue Jun 12, 2024 · 11 comments
Open
3 of 5 tasks

Configurable multiple node pools [EPIC] #18709

pbochynski opened this issue Jun 12, 2024 · 11 comments
Assignees
Labels
area/control-plane Related to all activities around Kyma Control Plane Epic

Comments

@pbochynski
Copy link
Contributor

pbochynski commented Jun 12, 2024

Description
Kyma clusters should support multiple machine types simultaneously. For example GPU and ARM nodes, network, memory and CPU optimized nodes, etc.

Acceptance criteria:

Reasons
Our customers demand ARM and GPU nodes in Kyma clusters to run their workload on the architecture supporting their use cases. Examples:

Related issues

@pbochynski pbochynski added Epic 2023-Q4 Planned for Q4 2023 2024-Q4 and removed 2023-Q4 Planned for Q4 2023 labels Jun 12, 2024
@tobiscr tobiscr self-assigned this Jul 1, 2024
@tobiscr tobiscr changed the title Configurable multiple node pools Configurable multiple node pools [EPIC] Jul 3, 2024
@tobiscr
Copy link
Contributor

tobiscr commented Jul 8, 2024

Open questions

  • What are the customer needs for a second worker pool (cost saving?) ?
  • Are we allowing 0 worker pools?
  • What machine sizes are we going to support - do we limit the supported types for additional worker pools?
  • Do we have always a mandatory system-worker pool which is used for Kyma workload to ensure we have at least one compatible worker pool configured?
  • What config-parameters are we exposing (arch, min-max pool size, container-runtime etc.)?
  • How do we deal if customer selects ARM as worker pool - we have to ensure our workloads won't be installed on this architecture (Affinity for pool required)?
  • Is Gardner having limitations for multiple worker pools (e.g. the run their own workloads within K8s, e.g. cert-manager)?

Impacts

  • We have to ensure our modules are compatible with all supported worker pool (e.g. Istio could be mandatory for particular workloads even on second worker pool)
  • Exposing more configurable parameters increases testing efforts on our side!
  • Pod-Affinity is required to ensure Kyma workloads are per default scheduled on the "system worker pool"

@tobiscr
Copy link
Contributor

tobiscr commented Jul 9, 2024

Feedback from stakeholders:

@ebensom :

  • from operational side, we should not expose all worker-pool configurations of the Shoot spec (e.g. each worker can have different linux-image versions which can lead to security implications).
  • Patching worker pools need enhancement of the current logic to support multiple worker pools. The upgrade has to happen pool-by-pool.
  • We should in the midterm also consider to support gVisor support for our default worker pool (used by Kyma workload).

@varbanv :

  • Customers want to be able to configure different worker pools with different configurations (e.g. arch, sizes, gpu etc.). One reason is to deal with temporary load peaks or for cost saving reasons.
    Conclusion:
    1. expose everything(?) in regards to machine types (not needed from day 1 - we can add machine types when customer requests them)
    2. deal properly with billing (costs perspectives are very different between machine types and change frequently)
  • Workload has to run on particular worker pool (e.g. for cost saving purposes)
    Conclusion:
    1. worker pool affinity required
    2. Kyma runs always in its own worker-pool (only limited configurable by customers) to separate Kyma workloads from customers. But its not a dedicated worker pool for Kyma - it's still allowed for customers to schedule workloads in this worker pool.
    3. customers can add additional worker pools which can be (fully) configured by them
    4. we accept the risk that it's not guaranteed that a machine type is in each region available (depends on hyperscaler)
    5. it's acceptable to make the worker-pool configurable outside of BTP cockpit (e.g. via kubectl calls - technical feasibility has to be clarified)
    6. we have to deal with issues reported by Gardener properly and expect failure cases (e.g. machine type not supported in particular region) which have to be reported to customers
    7. We start with a predefined list of machine types and we extend it when a need becomes visible
    8. We have to make sure Kyma supports the offered worker pool configurations properly, like on ARM architecture (e.g. having daemonset installed on worker pools, e.g. Istio etc.)
      • Gardener workloads have also to support these worker pool configuration
      • Assumption: Gardener supports everything they offer in the Shoot spec.
    9. Special configuration options for worker pools are:
      • Regions are for all zones equal (cluster workers are not allowed to run in different regions), also the CIDR configuration has to reside in the same network
      • Is has to be possible to configure the AZs for additional worker pools (e.g. having just 1 AZ for a worker pool)
      • Configuration has to be adjustable, e.g. node amount can be set to 0.
      • Is has to be clarified how to add support for NVidia GPUs (drivers are per default missing)

Currently supported worker parameter in RuntimeCR: https://github.com/kyma-project/infrastructure-manager/blob/main/config/samples/infrastructuremanager_v1_runtime.yaml#L56

@tobiscr
Copy link
Contributor

tobiscr commented Jul 9, 2024

Next steps / Action items:

  1. @PK85 + @ebensom : decide on the configruation options we are exposing for customers and track it in this issue
  2. @PK85 : Inform @a-thaler about the results
  3. @zhoujing2022 has to be informed about adjusting the testing strategy to cover the new architectures (at least required for modules which require deamonset) - TBC if we run them only on the Kyma dedicated worker pool (via affinity) or the daemonset has to be compatible with new architectures

@marco-porru
Copy link

In general, I see a bigger demand for GPUs explicitly requested by different teams, some about AI, others for ML algorithms. The scope, in any case, is always to have dedicated nodes to run specific tasks.

Reasonable also to include m6g and m6in (or the current available generation) for SAP for Me
One note on g5 and r7i this is required for SAP Intelligent Product Recommendation

@PK85
Copy link
Contributor

PK85 commented Jul 25, 2024

@tobiscr

  1. @PK85 + @ebensom : decide on the configruation options we are exposing for customers and track it in this issue

We will go simple on KEB side. We will keep those (mandatory) parameters on root for system node pool(We will adjust descriptions):

"autoScalerMax": ...
"autoScalerMin": ..
"machineType": ..

NOTE: this is always HA min 3 nodes. We need to decide how to name that worker node pool, probably we use some name right now.

and new (optional )array of worker nodes for customer usage:

additionalWorkerNodePools [ 
{
"name": ?
"autoScalerMax": ...
"autoScalerMin": ..
"machineType": ..
}
]

NOTE: for now same validation as for system ones, thta means HA is mandatory.

About machineTypes we keep what we have for now, not extending that. Reason is that we first need to focus to run Kyma modules only in the system worker node pool. And second reason is that existing KMC will work without changing anything.

Later when we will release that and see that everything works we can add new machine Types including GPU etc, that requires to adjust billing etc.

Cheers, PK

@ChristophRothmeier
Copy link

Hello,
my name is Christoph, i am project manager for Ingentis and we are using kyma running on SAP BTP. (currently running 10 clusters in 4 different landscapes).
We are also looking forward to having different node pools in kyma, with the following use case:

We have some workloads, that require a very high amount of memory in a single operation. The requirements can go up to 128 GB of RAM. Of course we do not want to run all nodes of our cluster with 128 GB machines, cause this would be very expensive. The operations itself can not be optmized with low effort (We are generating large export files for power point and PDF and the third party libraries we are using for this, do not support streamed or chunked exports, they require to hold all in memory).

So for us it would be important to have system node pool with small machines (like 16 GB or 32 GB) and than an additional node pool for the heavy workloads (like 128 GB machines). It would be important for us to be able to scale down the additional node pool to zero, cause we only need the expensive machines in case there are heavy workloads. So in the moment a user queues in a heavy workload, we would spawn a pod on the additional node pool, the node pool should scale up, executes the workload (which typically needs some hours) and then scale down to zero, after the workloads are done.

We do not require to have new machine types, like GPU or ARM machines.

I hope this is a state we can reach at some point. As I understand, it's currently planned to release additional node pools with HA , so they have to have at least 3 nodes permanently, without the option to scale to zero?

Kind regards,
Christoph

@tobiscr
Copy link
Contributor

tobiscr commented Sep 23, 2024

Hi @ChristophRothmeier , thanks for your request.

The multiple worker pool feature is currently in implementation and will be rolled out till end of this year. The list of supported machine types is at the beginning not extended and includes the same machine types as we offer when creating a new Kyma runtime via BTP cockpit. But support for additional machine types is already agreed and will be added soon after the worker pool feature is productive.

For go-live, we will also offer only worker pools with HA support (means, 3 nodes are the minimum). Scaling to 0 nodes is with a HA-supporting worker pool not possible but can be achieved by dropping the worker pool and re-creating it afterwards.

We are already in discussions to allow non-HA supporting worker pools with < 3 nodes. Such pools would also allow a scaling to 0 nodes.

@ChristophRothmeier
Copy link

Hi Tobias,

thanks for the response.
for us it would be huge, to have the ability to scale down additional worker pools to zero with non-HA support. Could you send an update in this issue as soon your discussions about this topic have progressed and it is clear if and when it will be implemented?

Thanks
Christoph

Copy link

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2024
@a-thaler a-thaler removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024
@tobiscr tobiscr added area/control-plane Related to all activities around Kyma Control Plane and removed 2024-Q4 labels Dec 2, 2024
@tobiscr
Copy link
Contributor

tobiscr commented Jan 15, 2025

Hi @ChristophRothmeier -we are currently working on the worker-pool implementation. We are not sure if we can support scaling to 0 nodes as it seems that Gardener expects that the maxNodesparameter is always >= 1. But we are still validating available options to support temporarily shutdowns of worker pools.

@ChristophRothmeier
Copy link

Hi @tobiscr
Beeing able to scale additional worker nodes to minimum of 1, would already be a huge benefit compared to beeing forced into HA with minimum 3 nodes.

For example, we are currently running 64 GB nodes in some locations, cause some single heavy workloads need that much of RAM. With additional worker nodes beeing able to be scaled to 1, we could downsize the main machines to 16 GB, with an additional worker pool with 1x 64 GB machine (scaling up to more dynamically on more workload).
So we could reduce from 192 GB in total down to 112 GB in total with 3 x 16 GB and 1 x 64 GB. That would already save us a lot of resources, and would be a huge benefit for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Related to all activities around Kyma Control Plane Epic
Projects
None yet
Development

No branches or pull requests

6 participants