Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up read/write auto scaling for mobile-save-for-later-*-articles #96

Merged
merged 5 commits into from
Nov 23, 2023

Conversation

mbd0910
Copy link
Contributor

@mbd0910 mbd0910 commented Nov 22, 2023

What does this change?

This PR adds read and write autoscaling to the saved articles DynamoDB table. In the past we've been tweaking the provisioning up and down in response to the demands on the table. During spikes in traffic (such as when the queen died), the table can become grossly under-provisioned for a short period of time. We also see S4L usage fluctuate throughout the day. Both the spikes and hourly fluctuation means the "High 5XX error % from mobile-save-for-later (ApiGateway) in PROD" alarm fires more often than we'd like.

It therefore makes sense to set up read/write autoscaling so that the table provisioning moves in response to the S4L usage throughout the day. It's been configured to scale out quickly (it will bump up capacity every 10 seconds if required), and scale back in more slowly (every 60 seconds) to ensure we maintain a decent level of service. I've set target utilisation to 70%, so there should always be a bit of wiggle room in the consumed:provisioned ratio. I set the minimum capacity values based on looking at graphs in the AWS console. I set the maximum capacity values based on the peaks we used for big news events in the past. So these figures are guesses rather than scientific but they seem sensible enough.

I found some examples throughout the Guardian codebase for inspiration:

How to test

I've tested the setup of the minimum and maximum capacity units in the CODE environment by deploying this branch. Note that the values in the screenshot below differ as obviously we don't want to pay for unnecessarily high capacity in CODE. As it worked against the CODE version of the table, I am assuming we'll have no issues in production.

image

How can we measure success?

We should see fewer "High 5XX error % from mobile-save-for-later (ApiGateway) in PROD" alarms fired in the P&E/Apps/ServerAlerts Google Chat space. We'll also be able to look at the graphs in AWS and check that the provisioned capacity moves in line with traffic.

Have we considered potential risks?

The main risk is that the scale out time limit is too high for the provisioning to increase quickly enough in response to a massive spike in traffic. But having seen other teams use similar values, the risk is low and it's very easy to tweak these values if we want.

@mbd0910 mbd0910 changed the title Auto scaling s4l dynamodb Set up read/write auto scaling for mobile-save-for-later-*-articles Nov 22, 2023
@mbd0910 mbd0910 marked this pull request as ready for review November 22, 2023 13:25
Copy link
Contributor

@lindseydew lindseydew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, nice one on introducing autoscaling

mobile-save-for-later/conf/cfn.yaml Show resolved Hide resolved
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref SaveForLaterWritesScalableTarget
TargetTrackingScalingPolicyConfiguration:
TargetValue: 70.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we want a larger target value for the writes compared to the reads? I may be wrong, but I think this one is has a higher usage than reads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this is a percentage rather than capacity units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this percentage represents the threshold at which AWS will start to scale out the read/write capacity if this percentage utilisation is breached. The minimum/maximum capacity values are those set at the top of the file. Whilst the read/write demands are indeed slightly different, the historical peaks seem to be for reads rather than writes, whereas the daily demand is for higher read capacity. I'm satisfied that using the same min/max values is a good enough starting point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense

Copy link

@tkgnm tkgnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @mbd0910 - looks good to me!

@mbd0910 mbd0910 merged commit 85ede47 into main Nov 23, 2023
2 checks passed
@mbd0910 mbd0910 deleted the auto-scaling-s4l-dynamodb branch November 23, 2023 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants