-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable shared TF provider usage #86
Comments
I could reliably reproduce the underlying issue without getting Crossplane provider involved as follows:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "4.35.0"
}
}
}
provider "aws" {
region = "us-west-1"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "HTSharedTest-us-west-1"
}
}
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "4.35.0"
}
}
}
provider "aws" {
region = "us-west-2"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "HTSharedTest-us-west-2"
}
}
terraform apply -auto-approve -input=false -lock=false -json&
terraform apply -auto-approve -input=false -lock=false -json&
terraform apply -auto-approve -input=false -lock=false -json&
terraform apply -auto-approve -input=false -lock=false -json&
terraform apply -auto-approve -input=false -lock=false -json& Observed different behaviours on different attempts as follows:
If I run them sequentially, only 1 VPC per region is created as expected. Digging in further to see what is really going on... |
TL;DR; This seems to be the expected behavior when you run Terraform in test mode, i.e., shared GRPC mode with The problem here is in this mode of operation, we have a shared provider instance, and
The PR adding support for this mode in Terraform binary mentions the following caveat, which is relevant here:
Another example of this is the expected behavior; this PR adds some warning if an already configured provider is reconfigured during some concurrent runs against a shared provider instance. Finally a relevant note from the Terraform documentation:
@ulucinar checked your shared provider implementation in Upjet and noticed that you're running the provider with |
Currently, I am seeing the following options: A. Keep shared GRPC mode disabled and use terraform just like an end user. For now, my vote would be for option A and focusing on other possible improvements on performance. @ulucinar, I would really appreciate if you could check and let me know if I am missing something. |
Hi @turkenh,
IIRC, I had to include the |
One other thing that we need to be careful about is that disabling the shared gRPC server (i.e., running the native plugin as a shared server between multiple clients) seems to have resolved certain duplication issues reported for provider-aws but we are still running provider-{gcp,azure} in that mode. My feeling is that the issues are provider specific (this is also supported by the fact that similar issues have not yet been observed with the other providers, or maybe we are not yet aware or not yet connected the dots), but still, we may decide to act cautiously and consider disabling this mode for others also. I think we should seriously consider this (as this mode of operation is not what Terraform expects for production use) if we can improve performance by tweaking some other parameters and we find ourselves in a position where we think the performance gain we get from the shared gRPC server mode (on top of some future improvements) is not worth taking the risk. I believe this is also inline with what you propose in |
Yes, this is how I was running the plugin: ./terraform-provider-aws_v4.35.0_x5 -debug
{"@level":"debug","@message":"plugin address","@timestamp":"2022-11-03T21:50:02.758995+03:00","address":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin27659334","network":"unix"}
Provider started. To attach Terraform CLI, set the TF_REATTACH_PROVIDERS environment variable with the following:
TF_REATTACH_PROVIDERS='{"registry.terraform.io/hashicorp/aws":{"Protocol":"grpc","ProtocolVersion":5,"Pid":2373,"Test":true,"Addr":{"Network":"unix","String":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin27659334"}}}' And, IIUC this is how we are running in upjet: TF_PLUGIN_MAGIC_COOKIE=d602bf8f470bc67ca7faa0386276bbdd4330efaf76d1a219cb4d6991ca9872b2 ./terraform-provider-aws_v4.35.0_x5
{"@level":"debug","@message":"plugin address","@timestamp":"2022-11-03T21:50:14.119454+03:00","address":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin3440378070","network":"unix"}
1|5|unix|/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin3440378070|grpc| |
Based on my findings above, I would expect it to happen in others as well, however, we didn't see them because they were not tested/used under triggering conditions. |
@ulucinar it looks like it is already disabled for provider-azure and provider-gcp as well: https://github.com/upbound/provider-azure/blob/main/cluster/images/provider-azure/Dockerfile#L45 |
Thanks @turkenh, I was not aware that we had disabled it for Azure and GCP. I wonder whether it was intentional: It seems like we have disabled shared gRPC server while opensourcing those providers. |
Thank you for digging in @turkenh ! I am also leaning towards dropping shared provider altogether given that there is an inherent risk of a config/creds used in another resource due to the architecture. But at the same time, I received feedback from the community that it becomes too expensive to run these providers when the shared mode is disabled. I wonder if we'd gain some CPU/memory/time if we spin up a provider instance in Either way though, I agree that we can start by increasing the poll-interval to something like 10 minutes and making it configurable to alleviate some of the pain. |
Closing this issue as the default poll interval and the ability to configure it has been addressed with #140 |
What problem are you facing?
As per crossplane/terrajet#233 , there is memory and CPU usage improvements when we use a single instance of TF provider to be shared by all
terraform
CLI calls. However, we had to disable it once we discovered that TF AWS provider uses credentials of another resource in some cases, which causesresource not found
orresource already exists
errors. First seen in crossplane-contrib/provider-jet-aws#220How could Official AWS Provider help solve your problem?
We need to fix the problem in TF AWS provider, bump the version and run Uptest to make sure all works. There is currently no issue in https://github.com/hashicorp/terraform-provider-aws that talks about the problem as far as I'm aware since shared mode is used only in tests by TF community but hashicorp/terraform-provider-aws#26130 and hashicorp/go-plugin#215 could be related which are mentioned in crossplane/terrajet#305 .
I think the next steps are as following:
uptest-all
action to test all resources.The text was updated successfully, but these errors were encountered: