Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fargate tests are flakey #1041

Closed
rquitales opened this issue Feb 22, 2024 · 2 comments
Closed

Fargate tests are flakey #1041

rquitales opened this issue Feb 22, 2024 · 2 comments
Assignees
Labels
kind/engineering Work that is not visible to an external user resolution/fixed This issue was fixed

Comments

@rquitales
Copy link
Member

What happened?

Surfaced in #1039. Our daily cron job fails about 33% of the time due to our Fargate tests.

The error from the test is:

 utils.go:290: 
          	Error Trace:	/home/runner/work/pulumi-eks/pulumi-eks/examples/utils/utils.go:290
          	            				/home/runner/work/pulumi-eks/pulumi-eks/examples/utils/utils.go:225
          	            				/home/runner/work/pulumi-eks/pulumi-eks/examples/utils/utils.go:84
          	            				/home/runner/work/pulumi-eks/pulumi-eks/examples/utils/utils.go:90
          	            				/opt/hostedtoolcache/go/1.21.6/x64/src/runtime/asm_amd64.s:1650
          	Error:      	Not equal: 
          	            	expected: 3
          	            	actual  : 2
          	Test:       	TestAccFargate
          	Messages:   	2 out of 3 Pods are ready

It is unclear why only 2 of the 3 pods can become live and requires further debugging.

Example

#1039

Output of pulumi about

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@rquitales rquitales added needs-triage Needs attention from the triage team kind/engineering Work that is not visible to an external user and removed needs-triage Needs attention from the triage team labels Feb 22, 2024
@thomas11
Copy link
Contributor

@rquitales I think we also need to disable Fargate tests until this is resolved, otherwise we'll constantly get new P1s like this #1043.

thomas11 added a commit that referenced this issue Feb 23, 2024
Disable Fargate tests until #1041 is resolved, see that issue for context.

Fixes #1043
flostadler pushed a commit that referenced this issue Sep 4, 2024
Disable Fargate tests until #1041 is resolved, see that issue for context.

Fixes #1043
flostadler added a commit that referenced this issue Oct 16, 2024
To ease the impact of the breaking API changes caused by generating the
node SDK, we decided to add additional scalar inputs that simplify UX
across all SDKs (for more details [see internal
doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)).

This change adds the scalar properties mentioned in the doc and adds
acceptance tests for them.
While adding the acceptance tests I noticed that running pods on Fargate
doesn't work deterministically. In some cases the cluster fails to get
healthy (coredns stuck in pending).
This was caused by a race-condition between coredns starting and the
fargate profile being created. If the fargate profile deployed after
coredns, the pods got stuck in pending because they got assigned to the
`default-scheduler` instead of the `fargate-scheduler`.
The fix is relatively easy; making coredns depend on the fargate
profile.

I'll separately update the migration guide.

### New properties

| Existing Resource |  | New Top Level Property | Description |
| :---- | :---- | :---- | :---- |
| `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`clusterSecurityGroupId: Output<string>` | Only really useful property
of a security group. Used to add additional ingress/egress rules.
Default to `the EKS created security group id` |
| `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`nodeSecurityGroupId: Output<string>` | |
| `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \|
undefined>` | | `clusterIngressRuleId: Output<string>` | Only really
useful property of a rule. Default to `””` |
| `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | |
`defaultNodeGroupAsgName: Output<string>` | The only useful property of
the default node group is the auto scaling group. Exposing its name
allows users to reference it in IAM roles, tags, etc. Default to `””` |
| `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>`
| `fargateProfileId: Output<string>` | The id of the fargate profile.
Can be used to reference it. Default to `””` |
| | | `fargateProfileStatus: Output<string>` | The status of the fargate
profile. Default to `””` |
| | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` |
`oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` &
`oidcIssuer: Output<string` | Arn and Url are properties needed to set
up IAM identities for pods (required for the assume role policy of the
IAM role). Users currently need to trim the `https://` part of the url
to actually use it. We should expose `oidcProvider` with that already
done to ease usage. |


Fixes #1041
flostadler added a commit that referenced this issue Oct 17, 2024
To ease the impact of the breaking API changes caused by generating the
node SDK, we decided to add additional scalar inputs that simplify UX
across all SDKs (for more details [see internal
doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)).

This change adds the scalar properties mentioned in the doc and adds
acceptance tests for them.
While adding the acceptance tests I noticed that running pods on Fargate
doesn't work deterministically. In some cases the cluster fails to get
healthy (coredns stuck in pending).
This was caused by a race-condition between coredns starting and the
fargate profile being created. If the fargate profile deployed after
coredns, the pods got stuck in pending because they got assigned to the
`default-scheduler` instead of the `fargate-scheduler`.
The fix is relatively easy; making coredns depend on the fargate
profile.

I'll separately update the migration guide.

### New properties

| Existing Resource |  | New Top Level Property | Description |
| :---- | :---- | :---- | :---- |
| `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`clusterSecurityGroupId: Output<string>` | Only really useful property
of a security group. Used to add additional ingress/egress rules.
Default to `the EKS created security group id` |
| `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`nodeSecurityGroupId: Output<string>` | |
| `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \|
undefined>` | | `clusterIngressRuleId: Output<string>` | Only really
useful property of a rule. Default to `””` |
| `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | |
`defaultNodeGroupAsgName: Output<string>` | The only useful property of
the default node group is the auto scaling group. Exposing its name
allows users to reference it in IAM roles, tags, etc. Default to `””` |
| `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>`
| `fargateProfileId: Output<string>` | The id of the fargate profile.
Can be used to reference it. Default to `””` |
| | | `fargateProfileStatus: Output<string>` | The status of the fargate
profile. Default to `””` |
| | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` |
`oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` &
`oidcIssuer: Output<string` | Arn and Url are properties needed to set
up IAM identities for pods (required for the assume role policy of the
IAM role). Users currently need to trim the `https://` part of the url
to actually use it. We should expose `oidcProvider` with that already
done to ease usage. |


Fixes #1041
@pulumi-bot
Copy link
Contributor

This issue has been addressed in PR #1445 and shipped in release v3.0.0-beta.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/engineering Work that is not visible to an external user resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

4 participants