Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Throttling errors after migrating services to aws-sdk-go-v2 #34669

Closed
mlynch1985 opened this issue Nov 30, 2023 · 30 comments
Closed

[Bug]: Throttling errors after migrating services to aws-sdk-go-v2 #34669

mlynch1985 opened this issue Nov 30, 2023 · 30 comments
Labels
bug Addresses a defect in current functionality. service/athena Issues and PRs that pertain to the athena service. service/codebuild Issues and PRs that pertain to the codebuild service. service/codepipeline Issues and PRs that pertain to the codepipeline service. service/controltower Issues and PRs that pertain to the controltower service.

Comments

@mlynch1985
Copy link

Terraform Core Version

1.6.4

AWS Provider Version

5.28.0

Affected Resource(s)

aws_controltower_control

Expected Behavior

The Terraform plan should complete the refresh process successfully without error and allow for the apply stage to execute.

Actual Behavior

The refresh was interrupted due to the throttling errors preventing the plan/apply from completing.

Relevant Error/Panic Output Snippet

Error: reading ControlTower Control (arn:aws:organizations::000000000000:ou/o-abcdefghijk/ou-abcd-efghijklmno,arn:aws:controltower:us-east-1::control/BKEEVLXJOIZI): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
│
│   with module.ct_managed_controls.aws_controltower_control.vpc["BKEEVLXJOIZI/ou-abcd-efghijklmno"],
│   on modules\ct_managed_controls\main.tf line 122, in resource "aws_controltower_control" "api_gateway":
│  122: resource "aws_controltower_control" "api_gateway" {

Terraform Configuration Files

terraform {
  required_version = ">= 1.6.0, < 2.0.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region  = "us-east-1"
}

data "aws_region" "current" {}
data "aws_organizations_organization" "this" {}

data "aws_organizations_organizational_units" "level_one" {
  parent_id = data.aws_organizations_organization.this.roots[0].id
}

data "aws_organizations_organizational_units" "level_two" {
  for_each  = local.level_one_ous
  parent_id = each.value.id
}

#  ...  #

locals {
  level_one_ous = { for ou in data.aws_organizations_organizational_units.level_one.children : ou.name => ou }

  level_two_ous = merge([
    for parent_name, ou in data.aws_organizations_organizational_units.level_two :
    { for child in ou.children : "${parent_name}/${child.name}" => child }
  ]...)

  #  ...  #

  all_ous = merge(local.level_one_ous, local.level_two_ous, local.level_three_ous, local.level_four_ous, local.level_five_ous)
}

locals {
  api_gateway = {
    # [SH.APIGateway.1] API Gateway REST and WebSocket API execution logging should be enabled
    "OOTDCUSIKIZZ" = {
      "${local.all_ous["Deployments"].id}"    = local.all_ous["Deployments"].arn,
      "${local.all_ous["Infrastructure"].id}" = local.all_ous["Infrastructure"].arn,
      "${local.all_ous["Sandbox"].id}"        = local.all_ous["Sandbox"].arn,
      "${local.all_ous["Workloads"].id}"      = local.all_ous["Workloads"].arn
    }

    #  ...  #
  }

  #  ...  #
}

resource "aws_controltower_control" "api_gateway" {
  for_each = merge([for control, ou_map in local.api_gateway :
    { for ou_id, ou_arn in ou_map : "${control}/${ou_id}" => { "control" = control, "ou_arn" = ou_arn } }
  ]...)

  control_identifier = "arn:aws:controltower:${data.aws_region.current.name}::control/${each.value.control}"
  target_identifier  = each.value.ou_arn
}

Steps to Reproduce

Setup AWS Control Tower and copy the above code into main.tf. You will need to create the OU Structure and enable CT Controls to OU associations as it seems to throttle after the initial apply.

Debug Output

No response

Panic Output

No response

Important Factoids

After upgrading to AWS provider v5.28.0 and attempting to execute a plan/apply containing 10+ instances of the "aws_controltower_control" resource, we received throttling errors. When adding a constraint to the provider block to downgrade the AWS provider to <5.28.0 the issue is resolved. Alternatively we can pass in the -refresh=false switch to complete the apply successfully.

References

[Enhancement]: Migrate controltower service to aws-sdk-go-v2

Would you like to implement a fix?

None

@mlynch1985 mlynch1985 added the bug Addresses a defect in current functionality. label Nov 30, 2023
Copy link

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added service/controltower Issues and PRs that pertain to the controltower service. service/organizations Issues and PRs that pertain to the organizations service. labels Nov 30, 2023
@terraform-aws-provider terraform-aws-provider bot added the needs-triage Waiting for first response or review from a maintainer. label Nov 30, 2023
@RobbertDM
Copy link

RobbertDM commented Dec 6, 2023

This seems broader than controltower. I also have this for Athena:

│ Error: reading Athena WorkGroup (...): operation error Athena: GetWorkGroup, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

But also

│ Error: listing tags for Athena WorkGroup (arn:aws:athena:...): operation error Athena: ListTagsForResource, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

There seems to be a related open issue at the aws-sdk-go-v2 repo:
aws/aws-sdk-go-v2#1665

I'm on aws provider 5.29.0

@ewbankkit ewbankkit removed service/organizations Issues and PRs that pertain to the organizations service. needs-triage Waiting for first response or review from a maintainer. labels Jan 16, 2024
@ewbankkit
Copy link
Contributor

Relates #34409.

@terhirissa
Copy link

We get similar error with GetInlinePolicyForPermissionSet.

Error: reading SSO Permission Set Inline Policy (...): operation error SSO Admin: GetInlinePolicyForPermissionSet, failed to get rate limit token, retry quota exceeded, 3 available, 5 requested

The error exists on aws provider version 5.29.0 and above.

@neogibson
Copy link

neogibson commented Jan 23, 2024

We are also seeing this with CodePipeline Webhook resources on any version above 5.31.0. If we pin our provider version to 5.31.0 it's fine but 5.32.1 and 5.33.0 result in plan failures:

Error: reading CodePipeline Webhook (arn:aws:codepipeline:ca-central-1::webhook:example): 
operation error CodePipeline: ListWebhooks, failed to get rate limit token, retry quota exceeded, 3
 available, 5 requested

@shawnl-kb4
Copy link

Same: We cannot use this for managing controls due to a "ThrottlingException" resulting from making the API call to "ListEnabledControls".

I just got off the phone with AWS Control Tower folks, who suggested updating the retry logic. It would be great to see a fix for this.

@ewbankkit
Copy link
Contributor

My thinking on this is to add new provider configuration attribute(s) that will customize the AWS SDK for Go v2 retryer
https://github.com/aws/aws-sdk-go-v2/blob/4fce0fdec6c41822255f4c3ec17aa46a9b6e2ac3/aws/retry/standard.go#L160-L171
in particular a RateLimiter with a configurable (different from the default of 500) token bucket size.

@miguelaferreira
Copy link
Contributor

We are also facing crippling throttling on method ListTagsForResource for aws_config_config_rule resources.

@ewbankkit
Copy link
Contributor

@mlynch1985 9at al.) Could you please try setting retry_mode = "adaptive" in your provider configuration and see if this helps?

@neogibson
Copy link

neogibson commented Feb 5, 2024

@ewbankkit Thanks for the suggestion, setting that on the provider did work in my case, a plan was generated without those rate limit errors. However, on one of our workspaces that consistently plans in ~3 minutes on provider version 5.31.0, this setting seems to have increased the plan time to around 9 minutes on the latest provider version 5.35.0.

@ewbankkit
Copy link
Contributor

@neogibson Thanks for looking into this.
My guess is that we could fine time some of the options to get the behavior closer to AWS SDK for Go v1.
The maintainers have this on the agenda to discuss for this week's tech debt review.

@ewbankkit ewbankkit changed the title [Bug]: Throttling error after migrating controltower service to aws-sdk-go-v2 [Bug]: Throttling errors after migrating services to aws-sdk-go-v2 Feb 12, 2024
@github-actions github-actions bot added the service/organizations Issues and PRs that pertain to the organizations service. label Feb 12, 2024
@ewbankkit ewbankkit added service/athena Issues and PRs that pertain to the athena service. service/codepipeline Issues and PRs that pertain to the codepipeline service. service/codebuild Issues and PRs that pertain to the codebuild service. and removed service/organizations Issues and PRs that pertain to the organizations service. labels Feb 12, 2024
@mlynch1985
Copy link
Author

@mlynch1985 9at al.) Could you please try setting retry_mode = "adaptive" in your provider configuration and see if this helps?

I tested with this option and unfortunately the error is still present.

Error: reading ControlTower Control (arn:aws:organizations::012345678912:ou/o-abcdefghij/ou-abcd-abcdefgh,arn:aws:controltower:us-west-2::control/PBGUIXCOFNGC): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

@ewbankkit
Copy link
Contributor

hashicorp/aws-sdk-go-base#918, incorporated into the Terraform AWS Provider via #35817 should address the failed to get rate limit token, retry quota exceeded errors.
As we have not been able to reproduce the throttling errors in our testing we cannot guarantee that all error cases have been dealt with, so I will leave this issue open for comments.
The fix will be available in Terraform AWS Provider v5.37.0, likely released tomorrow.

@jwh-exerp
Copy link

Unfortunately we are still seeing this issue even with AWS provider version v5.37.0, with our project which manages controls and their mappings across our organization.

Terraform configuration:

╰─ terraform version
Terraform v1.6.6
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.37.0
+ provider registry.terraform.io/hashicorp/local v2.4.1
terraform plan

...

Planning failed. Terraform encountered an error while generating this plan.

╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_EC2_VOLUME_INUSE_CHECK): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 2 available, 5 requested
│ 
│   with aws_controltower_control.detective["Build_AWS-GR_EC2_VOLUME_INUSE_CHECK"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_RDS_STORAGE_ENCRYPTED): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│ 
│   with aws_controltower_control.detective["Workloads/Shared_AWS-GR_RDS_STORAGE_ENCRYPTED"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│ 
│   with aws_controltower_control.detective["DR_AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 

... and many more similar quote exceeded examples

In an earlier comment: they found v5.31.0 didn't have this issue. It does for us and our project. We are pinned on v5.26.0 until a solution can be found.

@mlynch1985
Copy link
Author

I retested today with TF v.1.7.3 and AWS Provider v5.37.0 but still encountered the same errors. Reverting back to v.5.27.0 continues to be the work around.

@sixdaysandy
Copy link

I retested with the 5.37.0 update today after experiencing errors with the 5.36.0 provider, reverted back to the 5.35.0 provider as that throws no errors.
We're seeing it in CloudWatch: ListTagsForResource & CloudWatch: DescribeAlarms but only on very large states.

@kieran-lowe
Copy link
Contributor

kieran-lowe commented Feb 19, 2024

Yeah we're also experiencing this for CodeBuild.

Edit: pinning to 5.27.0 as suggested by @mlynch1985 worked for us. Will test with setting retry_mode.

@dthvt
Copy link
Contributor

dthvt commented Feb 19, 2024

FYI, I had some luck changing the provider configuration to include retry_mode = "adaptive" after the update to SDK v2. This resolved the throttling issues I was encountering w/ the Workspaces API.

@ewbankkit
Copy link
Contributor

For the next pass at a solution, we will add the ability to be able to configure the token bucket capacity for the retry throttling rate limiter (e.g. aws/aws-sdk-go-v2#1665 (comment)). This configured value will be used to initialize the capacity of every API client's token bucket.

@ewbankkit
Copy link
Contributor

With the very soon to be released v5.38.0 of the Terraform AWS provider we have added a new provider-level configuration parameter token_bucket_rate_limiter_capacity:

provider "aws" {
  token_bucket_rate_limiter_capacity = 5000
}

which allows the capacity of the rate limiter token bucket to be set.
The default is 500 tokens, so if you are experiencing throttling errors then please configure a larger value.

@mlynch1985
Copy link
Author

I test with the above suggested 5000 and still encountered the error. What is the downside to increasing this value? I don't want to set a ridiculously high number without understanding the potential risks. If it helps, I can setup a code dump so you can test the same code as me.

@ewbankkit
Copy link
Contributor

@mlynch1985 There are no additional resource consumed by increasing the value.

@mlynch1985
Copy link
Author

@ewbankkit I had to set my provider to 50,000 before it worked, however I was able to complete the plan/apply with this update. I will close this issue now. Thank you!

@richgreen-moj
Copy link

We are also facing crippling throttling on method ListTagsForResource for aws_config_config_rule resources.

We had issues with this over the last few weeks but today it has started to work again and seems to coincide with the update of provider to v5.41.0

Last provider it worked with was v5.38.0 , since then I've been trying some of the suggested workarounds e.g. retry_mode to adaptive and token_bucket_rate_limiter_capacity to a very large number but neither helped. We'll keep an eye on it.

@dandelo
Copy link

dandelo commented Mar 28, 2024

Fixed for us in v5.42.0, specifically looks like this fix:

provider: Change the default AWS SDK for Go v2 API client RateLimiter to ratelimit.None so that services migrated to AWS SDK for Go v2 maintain behavioral compatibility with AWS SDK for Go v1 (#36467)

@AbAvramidis
Copy link

We still facing some issues related to this, we noticed a strange behavior where the TF plan during the refreshing state of several resources just freezing and halts, time out after 40mins and the state is locked.
Anyone faces something similar even with the latest version?
We notice this behavior on any version higher than 5.32.

@neogibson
Copy link

Thanks for fixing this!

@ewbankkit
Copy link
Contributor

@AbAvramidis Do you know which services are exhibiting this behavior?

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/athena Issues and PRs that pertain to the athena service. service/codebuild Issues and PRs that pertain to the codebuild service. service/codepipeline Issues and PRs that pertain to the codepipeline service. service/controltower Issues and PRs that pertain to the controltower service.
Projects
None yet
Development

No branches or pull requests