Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple worker groups [KIM/feature] #46

Open
3 tasks done
pbochynski opened this issue Sep 21, 2023 · 13 comments
Open
3 tasks done

Multiple worker groups [KIM/feature] #46

pbochynski opened this issue Sep 21, 2023 · 13 comments
Assignees
Labels
area/control-plane Related to all activities around Kyma Control Plane bv/functional-suitability Business Value: Functional Suitability (see ISO 25010) kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@pbochynski
Copy link
Contributor

pbochynski commented Sep 21, 2023

Description
Enable possibility to create multiple worker groups with different machine types, volume types, node labels, annotations, taints.

See Gardener specs:

Current example shoot from Provisioner:

 workers:
      - cri:
          name: containerd
        name: cpu-worker-0
        machine:
          type: m5.xlarge
          image:
            name: gardenlinux
            version: 1.2.3
          architecture: amd64
        maximum: 1
        minimum: 1
        maxSurge: 1
        maxUnavailable: 0
        volume:
          type: gp2
          size: 50Gi
        zones:
          - eu-central-1a
        systemComponents:
          allow: true
    workersSettings:
      sshAccess:
        enabled: true

AC:

  • The value of field workers in RuntimeCR represents the Kyma worker pool. It is always positioned as FIRST element in the workers-array in Shoot-Spec.
  • The values of the field additonalWorkers in RuntimeCR represents the customer worker pool(s). It is always appended to the workers field in Shoot-Spec.
  • Before the worker pool feature is rolled out., we have to ensure that each cluster in KCP DEV + STAGE + PROD is using including a worker pool with the name cpu-worker-1.

Reasons
One size doesn't fit all. Many applications require specific nodes for particular services.

Relates to

@pbochynski pbochynski added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Related to all activities around Kyma Control Plane Epic labels Sep 21, 2023
@kyma-bot
Copy link
Contributor

This issue or PR has been automatically marked as stale due to the lack of recent activity.
Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

@kyma-bot kyma-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2023
@kyma-bot
Copy link
Contributor

This issue or PR has been automatically closed due to the lack of activity.
Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle stale

If you think that I work incorrectly, kindly raise an issue with the problem.

/close

@kyma-bot
Copy link
Contributor

@kyma-bot: Closing this issue.

In response to this:

This issue or PR has been automatically closed due to the lack of activity.
Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle stale

If you think that I work incorrectly, kindly raise an issue with the problem.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tobiscr tobiscr removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2023
@tobiscr tobiscr reopened this Dec 6, 2023
@tobiscr
Copy link
Contributor

tobiscr commented Dec 6, 2023

@pbochynski : QQ - is this feature still relevant? If yes, I will start the alignment with KEB guys as it needs also their involvment.

@tobiscr tobiscr added bv/functional-suitability Business Value: Functional Suitability (see ISO 25010) and removed Epic labels Dec 18, 2023
@pbochynski
Copy link
Contributor Author

The issue is part of bigger Epic: kyma-project/kyma#18195

@tobiscr
Copy link
Contributor

tobiscr commented Mar 21, 2024

We agreed with @varbanv and @PK85 to start with a minimal worker pool configuration, probably similar to the parameter we are currently already providing to Google.

@tobiscr tobiscr changed the title Multiple worker groups [KIM/feature] Multiple worker groups Jun 26, 2024
@tobiscr tobiscr changed the title [KIM/feature] Multiple worker groups Multiple worker groups [KIM/feature] Jun 26, 2024
@tobiscr
Copy link
Contributor

tobiscr commented Jul 14, 2024

JFYI:

It's important to set

    systemComponents:
          allow: true

to ensure the pool-nodes gets a label which indicates the related worker pool. This is important for later scheduling rules (via affinity configurations etc.)


It worked also without systemComponents, see #364

@tobiscr
Copy link
Contributor

tobiscr commented Sep 23, 2024

To cover billing requirements, we have to extend the contract of the RuntimeCR with 2 new fields: kyma-project/kyma-metrics-collector#89

@tobiscr
Copy link
Contributor

tobiscr commented Sep 23, 2024

Today we aligned the next steps for rolling out multiple worker pools, and it makes sense to distinguish in the RuntimeCR between the Kyma worker pool (primarily used by Kyma workloads) and customer worker pools.

It's meaningful to reflect this differentiation also in the RuntimeCR by using a dedicated field for the Kyma worker pool and another field (array) for the customer worker pools.

Proposal:

  • workers field is used for describing the Kyma worker pool
  • additionalWorkers (new field) is an array describing the worker pools created by the customer

@tobiscr
Copy link
Contributor

tobiscr commented Sep 27, 2024

Requires #396

@tobiscr
Copy link
Contributor

tobiscr commented Dec 17, 2024

From KIM side, we would expect following structure from KEB:

...
      workers:
      - cri:
          name: containerd
        machine:
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3
        name: cpu-worker-0
        providerConfig:
          apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
          instanceMetadataOptions:
            httpPutResponseHopLimit: 2
            httpTokens: required
          kind: WorkerConfig
        systemComponents:
          allow: true
        volume:
          size: 50Gi
          type: gp2
        zones:
        - eu-central-1b
        - eu-central-1a
        - eu-central-1c
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3

      additionalWorkers:   #<< NEW FIELD containing the customer worker pools as array !
      - cri:
          name: containerd
        machine:
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3
        name: fancy-workerpool-1
        providerConfig:
          apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
          instanceMetadataOptions:
            httpPutResponseHopLimit: 2
            httpTokens: required
          kind: WorkerConfig
        systemComponents:
          allow: true
        volume:
          size: 50Gi
          type: gp2
        zones:
        - eu-central-1b
        - eu-central-1a
        - eu-central-1c
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3
      - cri:
          name: containerd
        machine:
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3
        name: fancy-workerpool-2
        providerConfig:
          apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
          instanceMetadataOptions:
            httpPutResponseHopLimit: 2
            httpTokens: required
          kind: WorkerConfig
        systemComponents:
          allow: true
        volume:
          size: 50Gi
          type: gp2
        zones:
        - eu-central-1b
        - eu-central-1a
        - eu-central-1c
          architecture: amd64
          image:
            name: gardenlinux
            version: 1443.15.0
          type: m5.large
        maxSurge: 3
        maxUnavailable: 0
        maximum: 20
        minimum: 3

@koala7659
Copy link
Contributor

koala7659 commented Jan 21, 2025

Proposed following logic for Provider Config create/update with mutliple workers:

On shoot create:

  • There must be single main worker and optional set of additional workers in additionalWorkers collection
  • Additional workers can have different set of networking zones specified than main worker
  • InfrastructureConfig and controlPlaneConfig are created to include ALL zones used in ALL provided workers .
  • If infrastructureConfig and controlPlaneConfig are provided in RuntimeCR they will be used to create Shoot
    • If in such predefined infrastructureConfig zones cannot match with worker zones - the operation will fail

On shoot update:

  • There must be single main worker and optional set of additional workers
  • User can add new or remove existing workers from additionalWorkers collection. This operation will not affect existing infrastructureConfig and controlPlaneConfig
  • InfrastructureConfig and ControlPlaneConfig shoot data are immutable
  • For Azure and AWS shoots every zone used for each worker must be included into the set of zones used in the infrastructureConfig
  • As networking zones for updated workers cannot be changed - they will be overriden with zones from Gardener on the fly fly based on the Worker name
  • If infrastructureConfig and controlPlaneConfig are provided in RuntimeCR they will be used to create Shoot
    • If in such predefined infrastructureConfig zones cannot match with worker zones - the operation will fail

@tobiscr
Copy link
Contributor

tobiscr commented Jan 24, 2025

Final agreement with @kyma-project/gopher regarding worker pool sizing (meeting minute from 2025-01-23):

Type Description
Decision We support HA for worker-pools ONLY if the Kyma worker-pool has HA support enabled.Plans which are not supporting HA won’t be able to create additional worker pools with HA (only with non-HA).
Information In the future, KIM could create per default for all existing zones a infrastructureConfig entry. This would make it simpler to create additional worker pools and KEB won’t have to provide these data.
Decision KEB will provide the list of zones (incl. name) to KIM via RuntimeCR.KEB will also provide the list of zones per worker.KEB will never change the list of initially provided zones (-> KIM will never change the infrastructureConfig which stores the zone CIDRs etc.)The list of zones assigned to a worker -pool will be never be reduced by KEB (switching from HA to non-HA is not possible).
Information Nodes include a label with the worker-pool name:worker.gardener.cloud/pool: "cpu-worker-0"
Information Immutable worker-pool names are not supported by KEB. But we accept that a change of the name will re-create a worker-pool on customer side.This will be described in the description of the field + documentation.

The current implementation is not allowing to modify worker-pool zones (< they decide whether a pool is running in HA/non-HA mode). This change will be introduced after the first testing and stabilisation cycle is completed on KCP DEV.

@tobiscr tobiscr added this to the KIM 1.22.0 milestone Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Related to all activities around Kyma Control Plane bv/functional-suitability Business Value: Functional Suitability (see ISO 25010) kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants