Set up read/write auto scaling for mobile-save-for-later-*-articles #96

mbd0910 · 2023-11-22T09:48:14Z

What does this change?

This PR adds read and write autoscaling to the saved articles DynamoDB table. In the past we've been tweaking the provisioning up and down in response to the demands on the table. During spikes in traffic (such as when the queen died), the table can become grossly under-provisioned for a short period of time. We also see S4L usage fluctuate throughout the day. Both the spikes and hourly fluctuation means the "High 5XX error % from mobile-save-for-later (ApiGateway) in PROD" alarm fires more often than we'd like.

It therefore makes sense to set up read/write autoscaling so that the table provisioning moves in response to the S4L usage throughout the day. It's been configured to scale out quickly (it will bump up capacity every 10 seconds if required), and scale back in more slowly (every 60 seconds) to ensure we maintain a decent level of service. I've set target utilisation to 70%, so there should always be a bit of wiggle room in the consumed:provisioned ratio. I set the minimum capacity values based on looking at graphs in the AWS console. I set the maximum capacity values based on the peaks we used for big news events in the past. So these figures are guesses rather than scientific but they seem sensible enough.

I found some examples throughout the Guardian codebase for inspiration:

How to test

I've tested the setup of the minimum and maximum capacity units in the CODE environment by deploying this branch. Note that the values in the screenshot below differ as obviously we don't want to pay for unnecessarily high capacity in CODE. As it worked against the CODE version of the table, I am assuming we'll have no issues in production.

How can we measure success?

We should see fewer "High 5XX error % from mobile-save-for-later (ApiGateway) in PROD" alarms fired in the P&E/Apps/ServerAlerts Google Chat space. We'll also be able to look at the graphs in AWS and check that the provisioned capacity moves in line with traffic.

Have we considered potential risks?

The main risk is that the scale out time limit is too high for the provisioning to increase quickly enough in response to a massive spike in traffic. But having seen other teams use similar values, the risk is low and it's very easy to tweak these values if we want.

lindseydew

Looks great, nice one on introducing autoscaling

mobile-save-for-later/conf/cfn.yaml

lindseydew · 2023-11-23T10:58:00Z

mobile-save-for-later/conf/cfn.yaml

+      PolicyType: TargetTrackingScaling
+      ScalingTargetId: !Ref SaveForLaterWritesScalableTarget
+      TargetTrackingScalingPolicyConfiguration:
+        TargetValue: 70.0


Might we want a larger target value for the writes compared to the reads? I may be wrong, but I think this one is has a higher usage than reads

Oh, I see this is a percentage rather than capacity units

My understanding is that this percentage represents the threshold at which AWS will start to scale out the read/write capacity if this percentage utilisation is breached. The minimum/maximum capacity values are those set at the top of the file. Whilst the read/write demands are indeed slightly different, the historical peaks seem to be for reads rather than writes, whereas the daily demand is for higher read capacity. I'm satisfied that using the same min/max values is a good enough starting point.

Yeah makes sense

tkgnm

Thanks for this @mbd0910 - looks good to me!

mbd0910 added 3 commits November 21, 2023 13:44

Define S4L articles table auto scaling policy

35dbec4

Updated Snapshots

46491e6

Also scale read capacity and update snapshots

6fe0b1e

mbd0910 changed the title ~~Auto scaling s4l dynamodb~~ Set up read/write auto scaling for mobile-save-for-later-*-articles Nov 22, 2023

mbd0910 requested review from frankie297, vlbee and lindseydew November 22, 2023 13:25

mbd0910 marked this pull request as ready for review November 22, 2023 13:25

lindseydew approved these changes Nov 23, 2023

View reviewed changes

mbd0910 added 2 commits November 23, 2023 11:01

Add comment to add clarity around the units for scale in/out cooldown

8e9563b

Second comment to explain target percentage of consumed throughput

cf70a8e

tkgnm approved these changes Nov 23, 2023

View reviewed changes

mbd0910 merged commit 85ede47 into main Nov 23, 2023
2 checks passed

mbd0910 deleted the auto-scaling-s4l-dynamodb branch November 23, 2023 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up read/write auto scaling for mobile-save-for-later-*-articles #96

Set up read/write auto scaling for mobile-save-for-later-*-articles #96

mbd0910 commented Nov 22, 2023 •

edited

Loading

lindseydew left a comment

lindseydew Nov 23, 2023

lindseydew Nov 23, 2023

mbd0910 Nov 23, 2023

lindseydew Nov 23, 2023

tkgnm left a comment

Set up read/write auto scaling for mobile-save-for-later-*-articles #96

Set up read/write auto scaling for mobile-save-for-later-*-articles #96

Conversation

mbd0910 commented Nov 22, 2023 • edited Loading

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

lindseydew left a comment

Choose a reason for hiding this comment

lindseydew Nov 23, 2023

Choose a reason for hiding this comment

lindseydew Nov 23, 2023

Choose a reason for hiding this comment

mbd0910 Nov 23, 2023

Choose a reason for hiding this comment

lindseydew Nov 23, 2023

Choose a reason for hiding this comment

tkgnm left a comment

Choose a reason for hiding this comment

mbd0910 commented Nov 22, 2023 •

edited

Loading