Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive calls to AssumeRoleWithWebIdentity w/ IRSA #997

Closed
haarchri opened this issue Dec 1, 2023 · 4 comments · Fixed by #1235
Closed

Excessive calls to AssumeRoleWithWebIdentity w/ IRSA #997

haarchri opened this issue Dec 1, 2023 · 4 comments · Fixed by #1235
Labels
bug Something isn't working needs:triage

Comments

@haarchri
Copy link
Member

haarchri commented Dec 1, 2023

What happened?

Reduce the frequency of sts calls for AssumeRole/AssumeRoleWithWebIdentity by utilizing cached credentials. This will minimize the number of audit events in CloudTrail, and only renew the credentials when necessary, such as before expiration.

check implementation in community provider crossplane-contrib/provider-aws#828

How can we reproduce it?

What environment did it happen in?

  • Crossplane Version:
  • Provider Version:
  • Kubernetes Version:
  • Kubernetes Distribution:
@haarchri
Copy link
Member Author

@erhancagirici we have no cache enabled - we want to reopen this issue ? or create a new ?

@stevendborrelli
Copy link
Contributor

The fix in #1003 was a partial fix. We are still seeing significant calls to AssumeRoleWithWebIdentity, which propagates across audit and logging systems.

@haarchri
Copy link
Member Author

haarchri commented Mar 4, 2024

good to check crossplane-contrib/provider-aws#1148

@erhancagirici
Copy link
Collaborator

@haarchri As a first step, the refactoring at #1204 will reduce the STS calls by 50%.

I am working on a cache implementation and I've checked the crossplane-contrib/provider-aws#1148. I've been using a similar approach but there seems to be several things to consider with that and ongoing work, especially for some edge cases:

Community provider implementation relies on caching the constructed default AWS config.

  • this only covers plain IRSA. Other options like IRSA+AssumeRole, WebIdentity, WebIdentity+AssumeRole configurations still do not benefit from the cache.
  • Default AWS config is unconditionally cached and there is no concept of invalidation of that cached config. While it holds for most of the time (for IRSA), this has a big assumption that the configuration never changes on runtime.

Some background info (on the implementation details):
AWS config struct has the field Credentials which is an CredentialProvider interface that implements Retrieve() method.
By default, when AWS SDK is constructing an AWS Config (e.g. via LoadDefaultConfig ) it wraps the resulting CredentialProvider with a simple CredentialsCache, which is a credential provider itself. That cache is only for the wrapped CredentialProvider, for the particular credential config.

When we cache the config itself on top of that, we end up with a layered cache, i.e. we cache the underlying CredentialCache, and the wrapped CredentialProvider as a side effect.

From this point on speaking for the all auth cases in a generic manner:
In the context of the particular credential config, AWS's CredentialCache can handle expirations fine on its own.

However, the tricky part is handling potential changes in the AWS "config" at runtime, or more generally any "parameter" that would cause a different credential setup/CredentialProvider. In such cases, we cannot use the cached config, as it still points to an obsolete CredentialCache & CredentialProvider.

This means that we need to cache configs according to their content. Each set of config that results in a separate credential set, should have separate cache entries. The challenge here is:

  • We need to come up with a cache key that exhaustively cover any parameter that has an effect on the resulting credentials.
    • The relevant parameters differ according to the auth method needs to be handled case by case
      • CredentialProvider is an interface, the implementor is abstracted away, so the parameters are arbitrary.
        • there can be custom CredentialProvider implementations
      • need to know track implementation details of the CredentialsProvider to extract the parameters
    • Some parameters does not show up explicitly in the provider config. For example LoadDefaultConfig has a default set of resolvers that reads environment variables etc. It behaves
    • In some cases, theoretically it might not be sufficient to know the exhaustive list of (adjustable) parameters. The same parameter set can result in different credentials.
      • Say the config has a parameter that specifies a file path, and the file content was changed on runtime. From config's POV, it is the same config, but since the file content has changed, it is implicitly a different config.
  • Here, the concern in general is not unnecessarily invalidating the cache. It is rather operating with a stale cache entry.
    • stale config can still point to an valid credential. In this case, we would be operating with an unintended credential set.
    • stale config can have an invalid credential set. When we fail to detect this stale entry (say due to missing some parameter in the cache key), we are stuck with invalid credential until we prune it.

From dev/maintainability POV, the above points imply:

  • Cache implementation will be tightly coupled with CredentialProvider implementations, which increases the complexity of the flow.
  • Devs need to track existing auth methods implementations for upstream parameter changes
  • Devs need to track new auth methods and implement the relevant cache
  • Missing those can silently result in bugs

With all said above, we will take a bit more time to carefully evaluate the options on how to move forward. I'll further update here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants