-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fargate tests are flakey #1041
Labels
Comments
rquitales
added
needs-triage
Needs attention from the triage team
kind/engineering
Work that is not visible to an external user
and removed
needs-triage
Needs attention from the triage team
labels
Feb 22, 2024
Closed
@rquitales I think we also need to disable Fargate tests until this is resolved, otherwise we'll constantly get new P1s like this #1043. |
thomas11
added a commit
that referenced
this issue
Feb 22, 2024
flostadler
added a commit
that referenced
this issue
Oct 16, 2024
To ease the impact of the breaking API changes caused by generating the node SDK, we decided to add additional scalar inputs that simplify UX across all SDKs (for more details [see internal doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)). This change adds the scalar properties mentioned in the doc and adds acceptance tests for them. While adding the acceptance tests I noticed that running pods on Fargate doesn't work deterministically. In some cases the cluster fails to get healthy (coredns stuck in pending). This was caused by a race-condition between coredns starting and the fargate profile being created. If the fargate profile deployed after coredns, the pods got stuck in pending because they got assigned to the `default-scheduler` instead of the `fargate-scheduler`. The fix is relatively easy; making coredns depend on the fargate profile. I'll separately update the migration guide. ### New properties | Existing Resource | | New Top Level Property | Description | | :---- | :---- | :---- | :---- | | `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `clusterSecurityGroupId: Output<string>` | Only really useful property of a security group. Used to add additional ingress/egress rules. Default to `the EKS created security group id` | | `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `nodeSecurityGroupId: Output<string>` | | | `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \| undefined>` | | `clusterIngressRuleId: Output<string>` | Only really useful property of a rule. Default to `””` | | `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | | `defaultNodeGroupAsgName: Output<string>` | The only useful property of the default node group is the auto scaling group. Exposing its name allows users to reference it in IAM roles, tags, etc. Default to `””` | | `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>` | `fargateProfileId: Output<string>` | The id of the fargate profile. Can be used to reference it. Default to `””` | | | | `fargateProfileStatus: Output<string>` | The status of the fargate profile. Default to `””` | | | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` | `oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` & `oidcIssuer: Output<string` | Arn and Url are properties needed to set up IAM identities for pods (required for the assume role policy of the IAM role). Users currently need to trim the `https://` part of the url to actually use it. We should expose `oidcProvider` with that already done to ease usage. | Fixes #1041
flostadler
added a commit
that referenced
this issue
Oct 17, 2024
To ease the impact of the breaking API changes caused by generating the node SDK, we decided to add additional scalar inputs that simplify UX across all SDKs (for more details [see internal doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)). This change adds the scalar properties mentioned in the doc and adds acceptance tests for them. While adding the acceptance tests I noticed that running pods on Fargate doesn't work deterministically. In some cases the cluster fails to get healthy (coredns stuck in pending). This was caused by a race-condition between coredns starting and the fargate profile being created. If the fargate profile deployed after coredns, the pods got stuck in pending because they got assigned to the `default-scheduler` instead of the `fargate-scheduler`. The fix is relatively easy; making coredns depend on the fargate profile. I'll separately update the migration guide. ### New properties | Existing Resource | | New Top Level Property | Description | | :---- | :---- | :---- | :---- | | `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `clusterSecurityGroupId: Output<string>` | Only really useful property of a security group. Used to add additional ingress/egress rules. Default to `the EKS created security group id` | | `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `nodeSecurityGroupId: Output<string>` | | | `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \| undefined>` | | `clusterIngressRuleId: Output<string>` | Only really useful property of a rule. Default to `””` | | `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | | `defaultNodeGroupAsgName: Output<string>` | The only useful property of the default node group is the auto scaling group. Exposing its name allows users to reference it in IAM roles, tags, etc. Default to `””` | | `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>` | `fargateProfileId: Output<string>` | The id of the fargate profile. Can be used to reference it. Default to `””` | | | | `fargateProfileStatus: Output<string>` | The status of the fargate profile. Default to `””` | | | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` | `oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` & `oidcIssuer: Output<string` | Arn and Url are properties needed to set up IAM identities for pods (required for the assume role policy of the IAM role). Users currently need to trim the `https://` part of the url to actually use it. We should expose `oidcProvider` with that already done to ease usage. | Fixes #1041
This issue has been addressed in PR #1445 and shipped in release v3.0.0-beta.2. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
What happened?
Surfaced in #1039. Our daily cron job fails about 33% of the time due to our Fargate tests.
The error from the test is:
It is unclear why only 2 of the 3 pods can become live and requires further debugging.
Example
#1039
Output of
pulumi about
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: