Skip to content

Commit

Permalink
BFD-2856 : BFD Server ASG Network Scaling Policies are causing errone…
Browse files Browse the repository at this point in the history
…ous capacity oscillation (#1901)

Co-authored-by: Mitch Alessio <[email protected]>
Co-authored-by: Mitchell Alessio <[email protected]>
  • Loading branch information
3 people authored Aug 25, 2023
1 parent 2e34b93 commit a11eefa
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 32 deletions.
58 changes: 58 additions & 0 deletions ops/terraform/services/server/modules/bfd_server_asg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,61 @@ module "asg" {
}
}
```

<!-- BEGIN_TF_DOCS -->
## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | n/a |
| <a name="provider_external"></a> [external](#provider\_external) | n/a |

## Modules

No modules.

## Resources

| Name | Type |
|------|------|
| [aws_autoscaling_group.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group) | resource |
| [aws_autoscaling_notification.asg_notifications](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_notification) | resource |
| [aws_autoscaling_policy.filtered_networkin_high_scaling](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_policy) | resource |
| [aws_autoscaling_policy.filtered_networkin_low_scaling](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_policy) | resource |
| [aws_cloudwatch_metric_alarm.filtered_networkin_high](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_metric_alarm) | resource |
| [aws_cloudwatch_metric_alarm.filtered_networkin_low](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_metric_alarm) | resource |
| [aws_launch_template.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource |
| [aws_security_group.app](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource |
| [aws_security_group.base](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource |
| [aws_security_group_rule.allow_db_access](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group_rule) | resource |
| [aws_kms_key.master_key](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/kms_key) | data source |
| [aws_rds_cluster.rds](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/rds_cluster) | data source |
| [aws_subnet.app_subnets](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet) | data source |
| [external_external.rds](https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_asg_config"></a> [asg\_config](#input\_asg\_config) | n/a | `object({ min = number, max = number, max_warm = number, desired = number, sns_topic_arn = string, instance_warmup = number })` | n/a | yes |
| <a name="input_db_config"></a> [db\_config](#input\_db\_config) | Setup a db ingress rules if defined | `object({ db_sg = string, role = string, db_cluster_identifier = string })` | `null` | no |
| <a name="input_env_config"></a> [env\_config](#input\_env\_config) | All high-level info for the whole vpc | `object({ default_tags = map(string), vpc_id = string, azs = list(string) })` | n/a | yes |
| <a name="input_jdbc_suffix"></a> [jdbc\_suffix](#input\_jdbc\_suffix) | boolean controlling logging of detail SQL values if a BatchUpdateException occurs; false disables detail logging | `string` | `"?logServerErrorDetail=false"` | no |
| <a name="input_kms_key_alias"></a> [kms\_key\_alias](#input\_kms\_key\_alias) | Key alias of environment's KMS key | `string` | n/a | yes |
| <a name="input_launch_config"></a> [launch\_config](#input\_launch\_config) | n/a | `object({ instance_type = string, volume_size = number, ami_id = string, key_name = string, profile = string, user_data_tpl = string, account_id = string })` | n/a | yes |
| <a name="input_layer"></a> [layer](#input\_layer) | app or data | `string` | n/a | yes |
| <a name="input_lb_config"></a> [lb\_config](#input\_lb\_config) | Load balancer information | `object({ name = string, port = number, sg = string })` | `null` | no |
| <a name="input_mgmt_config"></a> [mgmt\_config](#input\_mgmt\_config) | n/a | `object({ vpn_sg = string, tool_sg = string, remote_sg = string, ci_cidrs = list(string) })` | n/a | yes |
| <a name="input_role"></a> [role](#input\_role) | n/a | `string` | n/a | yes |
| <a name="input_scaling_networkin_interval_mb"></a> [scaling\_networkin\_interval\_mb](#input\_scaling\_networkin\_interval\_mb) | The interval value in megabytes for evaluating the asg scaling capacity, based on the metric FilteredNetworkIn | `number` | `100000000` | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_asg_id"></a> [asg\_id](#output\_asg\_id) | n/a |
<!-- END_TF_DOCS -->
96 changes: 64 additions & 32 deletions ops/terraform/services/server/modules/bfd_server_asg/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ locals {
rds_reader_endpoint = data.external.rds.result["CustomEndpoint"] == "" ? data.external.rds.result["ReaderEndpoint"] : data.external.rds.result["CustomEndpoint"]

additional_tags = { Layer = var.layer, role = var.role }
scaleout_asg_capacities = [
{ capacity = length(var.env_config.azs) * 2, metric_lower_bound = 1 * var.scaling_networkin_interval_mb, metric_upper_bound = 2 * var.scaling_networkin_interval_mb },
{ capacity = length(var.env_config.azs) * 3, metric_lower_bound = 2 * var.scaling_networkin_interval_mb, metric_upper_bound = 4 * var.scaling_networkin_interval_mb },
{ capacity = length(var.env_config.azs) * 4, metric_lower_bound = 4 * var.scaling_networkin_interval_mb, metric_upper_bound = null }
]
}

## Security groups
Expand Down Expand Up @@ -184,8 +189,8 @@ resource "aws_autoscaling_group" "main" {
resource "aws_cloudwatch_metric_alarm" "filtered_networkin_low" {
alarm_name = "bfd-${var.role}-${local.env}-networkin-low"
comparison_operator = "LessThanThreshold"
datapoints_to_alarm = 5
evaluation_periods = 5
datapoints_to_alarm = 10
evaluation_periods = 10
threshold = 400 * 1000000 # 400 megabytes
treat_missing_data = "ignore"
alarm_actions = [aws_autoscaling_policy.filtered_networkin_low_scaling.arn]
Expand Down Expand Up @@ -231,7 +236,7 @@ resource "aws_cloudwatch_metric_alarm" "filtered_networkin_low" {
resource "aws_autoscaling_policy" "filtered_networkin_low_scaling" {
name = "bfd-${var.role}-${local.env}-networkin-low-scalein"
autoscaling_group_name = aws_autoscaling_group.main.name
adjustment_type = "ChangeInCapacity"
adjustment_type = "ExactCapacity"
metric_aggregation_type = "Average"
policy_type = "StepScaling"

Expand All @@ -246,33 +251,34 @@ resource "aws_autoscaling_policy" "filtered_networkin_low_scaling" {
# and the .0 precision modifier to ensure that Terraform's formatter does not pad the decimal
# part with 0s
metric_interval_upper_bound = format("%.0e", -300 * 1000000) # 300 megabytes
scaling_adjustment = -(length(var.env_config.azs) * 3)
scaling_adjustment = length(var.env_config.azs)
}

step_adjustment {
metric_interval_lower_bound = format("%.0e", -300 * 1000000) # 300 megabytes
metric_interval_upper_bound = format("%.0e", -200 * 1000000) # 200 megabytes
scaling_adjustment = -(length(var.env_config.azs) * 2)
scaling_adjustment = length(var.env_config.azs) * 2
}

step_adjustment {
metric_interval_lower_bound = format("%.0e", -200 * 1000000) # 200 megabytes
metric_interval_upper_bound = 0 # 0 megabytes
scaling_adjustment = -length(var.env_config.azs)
scaling_adjustment = length(var.env_config.azs) * 3
}
}

resource "aws_cloudwatch_metric_alarm" "filtered_networkin_high" {
alarm_name = "bfd-${var.role}-${local.env}-networkin-high"
comparison_operator = "GreaterThanThreshold"
comparison_operator = "GreaterThanOrEqualToThreshold"
datapoints_to_alarm = 1
evaluation_periods = 1
threshold = 100 * 1000000 # 100 megabytes
treat_missing_data = "ignore"
threshold = 1
treat_missing_data = "notBreaching"
alarm_actions = [aws_autoscaling_policy.filtered_networkin_high_scaling.arn]

metric_query {
id = "m1"
period = 0
return_data = false

metric {
Expand All @@ -285,9 +291,9 @@ resource "aws_cloudwatch_metric_alarm" "filtered_networkin_high" {
stat = "Average"
}
}

metric_query {
id = "m2"
period = 0
return_data = false

metric {
Expand All @@ -301,10 +307,49 @@ resource "aws_cloudwatch_metric_alarm" "filtered_networkin_high" {
}
}

metric_query {
id = "m3"
period = 0
return_data = false

metric {
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.main.name
}
metric_name = "GroupDesiredCapacity"
namespace = "AWS/AutoScaling"
period = 60
stat = "Average"
}
}

metric_query {
expression = "IF(m2/m1 > 0.01, m1, 0)"
id = "e1"
id = "networkin"
label = "FilteredNetworkIn"
period = 0
return_data = false
}

dynamic "metric_query" {
for_each = local.scaleout_asg_capacities
content {
id = "e${metric_query.key}"
label = "Set to ${metric_query.value.capacity} capacity units"
expression = "IF(${join(" && ", compact([
"networkin > ${metric_query.value.metric_lower_bound}",
metric_query.value.metric_upper_bound != null ? "networkin <= ${metric_query.value.metric_upper_bound}" : null,
"m3 < ${metric_query.value.capacity}"
]))}, ${metric_query.key + 1})"
return_data = false
}
}

metric_query {
expression = "MAX([${join(",", [for i in range(length(local.scaleout_asg_capacities)) : "e${i}"])}])"
id = "e${length(local.scaleout_asg_capacities)}"
label = "ScalingCapacityScalar"
period = 0
return_data = true
}
}
Expand All @@ -313,34 +358,21 @@ resource "aws_autoscaling_policy" "filtered_networkin_high_scaling" {
name = "bfd-${var.role}-${local.env}-networkin-high-scaleout"
autoscaling_group_name = aws_autoscaling_group.main.name
estimated_instance_warmup = var.asg_config.instance_warmup
adjustment_type = "ChangeInCapacity"
adjustment_type = "ExactCapacity"
metric_aggregation_type = "Average"
policy_type = "StepScaling"

# All metric interval bounds are calculated by _adding_ the value of the bound to the threshold
# of the alarm that this scaling policy operates on. For example, if the alarm threshold is 100MB
# and the upper bound and lower bounds for a step adjustment are 300MB and 100MB, the step
# adjustment executes if the metric is greater than 200MB and less than 400MB
step_adjustment {
metric_interval_lower_bound = format("%.0e", 300 * 1000000) # 300 megabytes
scaling_adjustment = length(var.env_config.azs) * 3
}

step_adjustment {
metric_interval_lower_bound = format("%.0e", 100 * 1000000) # 100 megabytes
metric_interval_upper_bound = format("%.0e", 300 * 1000000) # 300 megabytes
scaling_adjustment = length(var.env_config.azs) * 2
}

step_adjustment {
metric_interval_lower_bound = 0 # 0 megabytes
metric_interval_upper_bound = format("%.0e", 100 * 1000000) # 100 megabytes
scaling_adjustment = length(var.env_config.azs)
dynamic "step_adjustment" {
for_each = local.scaleout_asg_capacities
content {
metric_interval_lower_bound = step_adjustment.key
metric_interval_upper_bound = step_adjustment.key + 1 != length(local.scaleout_asg_capacities) ? step_adjustment.key + 1 : null
scaling_adjustment = step_adjustment.value.capacity
}
}
}

## Autoscaling Notifications
#
resource "aws_autoscaling_notification" "asg_notifications" {
count = var.asg_config.sns_topic_arn != "" ? 1 : 0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,9 @@ variable "jdbc_suffix" {
description = "boolean controlling logging of detail SQL values if a BatchUpdateException occurs; false disables detail logging"
type = string
}

variable "scaling_networkin_interval_mb" {
description = "The interval value in megabytes for evaluating the asg scaling capacity, based on the metric FilteredNetworkIn"
type = number
default = 100000000
}

0 comments on commit a11eefa

Please sign in to comment.