Monitoring update #66

ryanjjung · 2024-12-09T21:00:47Z

Description of the Change

This PR accomplishes several things, but the biggest of them is that a substantial chunk of high-priority alarms for our environments has been defined, and that the previously defined monitors have been improved upon. We now monitor for the following metrics going awry:

Increased 4xx errors in the CloudFront Distribution, indicating large numbers of attempts to access files that don't exist.
CloudFront Function CPU utilization, indicating that our rewrite function (which, for example, points requests for / to /index.html) is doing more work than it ought to be.
Target groups detecting unhealthy hosts, which tells us of problems in the Fargate service.
Target group errors, showing elevations of 5xx errors in a load balanced application.
Load balancer errors, showing 5xx errors coming from the load balancer itself, not inclusive of application errors.
Increased response times for load balanced applications.
CPU and memory utilization for ECS services, which would suggest we need to take some kind of scaling action (or solve a memory leak or infinite loop or something like that).

I was testing a lot of the documentation stuff while working through the new code, which led me to also take care of Issue #53 at the same time. I discovered a number of issues with parts of the documentation being inaccessible, so I fixed that as well.

Some other minor repairs are documented inline.

Benefits

Better insights into the operating state of our applications. A more streamlined and complete documentation experience. Bugs squashed.

Applicable Issues

#57, #53

ryanjjung · 2024-12-09T21:01:20Z

tb_pulumi/ci.py

@@ -131,6 +131,7 @@ def __init__(
                                'ecr:CompleteLayerUpload',
                                'ecr:DescribeImages',
                                'ecr:InitiateLayerUpload',
+                                'ecr:GetDownloadUrlForLayer',


This is needed for automation to pull images from ECR.

ryanjjung · 2024-12-09T21:02:41Z

tb_pulumi/ec2.py

        # Build a security group that allows ingress on our listener port
        security_group_with_rules = tb_pulumi.network.SecurityGroupWithRules(
            f'{name}-sg',
            project,
-            vpc_id=subnets[0].vpc_id,
+            vpc_id=primary_subnet.vpc_id,


This change comes from feedback from @MelissaAutumn about referencing this by ID constantly instead of having a more representative name for the value.

MelissaAutumn

Glanced over the PR, looks good but I don't know this codebase well enough. However I've added some comments!

MelissaAutumn · 2024-12-10T07:07:57Z

tb_pulumi/cloudwatch.py

+        alb_5xx_opts.update(self.overrides['alb_5xx'] if 'alb_5xx' in self.overrides else {})
+        alb_5xx_enabled = alb_5xx_opts['enabled']
+        del alb_5xx_opts['enabled']
+        alb_5xx = pulumi.Output.all(res_name=resource.name, res_suffix=resource.arn_suffix).apply(


Thanks for the rename, it's far more easier to determine that this means 500 errors than fivexx.

MelissaAutumn · 2024-12-10T07:08:33Z

tb_pulumi/cloudwatch.py


        # Alert if response time is elevated over time
        response_time_opts = {
            'enabled': True,
            'evaluation_periods': 2,
-            'period': 300,
+            'period': 60,


These seems like a tweakable option. Would it make sense to pass this in as a param with a default val of 60?

If not maybe a constant would do well here.

These are per-alarm defaults, but I could probably reorg a little bit. Say, create a global constant providing a default dict which then gets copied and tweaked per alarm. I'll poke around with that today and see if it makes the code read a little cleaner.

That works for me. My concern here was more so having like defaults that we may want to enforce differently without having to dig through each python module to tweak.

I took a look at this. Let's say we create a default dict in the constants.py file that looks something like this:

CLOUDWATCH_METRIC_ALARM_DEFAULTS = { 'enabled': True, 'evaluation_periods': 2, 'period': 60, 'statistic': 'Average', 'threshold': 10 }

The statistic and threshold values will differ frequently, especially threshold. The statistic might be common enough to warrant an overridable constant, but the threshold will vary on almost every alarm. So on the usage side of this, the start of this pattern would go from one line defining the full default dict to multiple lines overriding the default dict:

target_5xx_opts = constants.CLOUDWATCH_METRIC_ALARM_DEFAULTS.copy() target_5xx_opts.update({'statistic': 'Sum', 'threshold': 20}) target_5xx_opts.update(self.overrides['target_5xx'] if 'target_5xx' in self.overrides else {}) ...

And then all the rest. So overall, changing the pattern in this way would add additional complexity.

I did go back over the docs and saw there was module-level documentation for this stuff, but not really any user-friendly explanations of how this all works. So I ended up writing a great deal of docs describing how to use this system, how to find the values it wants you to provide, etc.

I'm also going to add some tagging to these resources that will tell you explicitly what they are. I see a usage pattern where we build a monitoring module at defaults, then go figure out what monitors need tweaking. Those monitors could have tags on them with the necessary data for writing the configs.

Sounds good to me, we use defines.py in Appointment but constants also works!

Also I think you might be able to replace the double update with:

.update({ 'statistic': 'Sum', 'threshold': 20, **self.overrides.get('target_5xx', {}) }

That does look cleaner. I just pushed all the tagging stuff. I'll revisit this now.

That refactor works really well, and makes quite a bit of this code cleaner. Great idea!

MelissaAutumn · 2024-12-10T07:09:02Z

tb_pulumi/cloudwatch.py

+        unhealthy_hosts_opts = {
+            'enabled': True,
+            'evaluation_periods': 2,
+            'period': 60,


likewise tweakable option

fwiw, all of these options are tweakable through the YAML file. The line beneath this about "overrides" is checking the YAML config to see if a config for this particular resource has been provided. If so, it merges the options in. To me, the issue is one of readability with this being a more or less default dict with minimal changes from one alarm to the next. I'll push a proposed fix today.

Ah right that's what I get for skimming the PR at 11pm 😅
Thanks for the info!

ryanjjung · 2024-12-10T18:52:47Z

tb_pulumi/cloudwatch.py

        project: tb_pulumi.ThunderbirdPulumiProject,
        resource: aws.lb.load_balancer.LoadBalancer,
+        monitoring_group: CloudWatchMonitoringGroup,


I only moved this to make it look like all the other function defintions.

ryanjjung · 2024-12-10T18:55:39Z

tb_pulumi/ec2.py

@@ -101,7 +104,7 @@ def __init__(
        )

        # Build the load balancer first, as other resources must be attached to it later
-        nlb = aws.alb.LoadBalancer(
+        nlb = aws.lb.LoadBalancer(


These are actually the same thing under the hood, such that aws.lb is the authoritative library and aws.alb is just a reference to it. This leads to no actual change in the resource.

MelissaAutumn

Looks good to me! Added a discussion point but it's not a blocker.

MelissaAutumn · 2024-12-10T19:00:02Z

docs/conf.py

-html_theme = 'insegel'  # Clean black-on-white theme
-if os.environ.get('TBPULUMI_DARK_MODE', False):
-    html_theme = 'furo'  # Dark theme, easy on the eyes
+html_theme = 'furo'  # Clean, organized theme with dark and light modes


I'm okay with furo, but our other readthedocs pages use the standard boring readthedocs theme. (In tb-accounts case mermaid extension doesn't mesh well with furo without some tweaks.) Should we discuss standardizing this in the next code / infra meeting?

We can. It's easy enough to swap out to that theme. The lack of a dark mode drives me bonkers. My eyes can't handle all those white pixels. Totally up for discussion during an infra meeting.

no worries, I also hate light themes! It's kind of bonkers that the readthedocs theme doesn't have a dark mode via prefers-color-scheme.

ryanjjung added 12 commits December 2, 2024 14:06

Allow users to specify CF function associations

d9cd946

Add GetDownloadUrlForLayer permission to CI for deployments

d1ce7e1

Set convenience variable for primary subnet

a49d37c

aws.alb.LB is the same as aws.lb.LB

604d515

Fix dependency bug

cf932c2

Fix docs

b615e19

Add CloudFrontAlarmGroup

d763793

Add missing docs pages

ba735d8

Remove insegel docs theme, use Furo exclusively

d9f39f9

Add CloudFrontFunctionAlarmGroup

7bbf1ed

Add Target Group monitoring

398cb41

Plenty of docs updates

ba6b8bf

ryanjjung commented Dec 9, 2024

View reviewed changes

Clarify documentation links

090c88f

MelissaAutumn reviewed Dec 10, 2024

View reviewed changes

Add clarity to monitors documentation

884f9bb

ryanjjung force-pushed the issue-57 branch from c260425 to 884f9bb Compare December 10, 2024 17:20

Add alarm tagging

4388898

ryanjjung force-pushed the issue-57 branch from a85b2c1 to 4388898 Compare December 10, 2024 18:24

Refactor metric option override pattern

8f00746

ryanjjung force-pushed the issue-57 branch from 7ec1508 to 8f00746 Compare December 10, 2024 18:48

ryanjjung requested a review from MelissaAutumn December 10, 2024 18:48

ryanjjung commented Dec 10, 2024

View reviewed changes

MelissaAutumn approved these changes Dec 10, 2024

View reviewed changes

ryanjjung merged commit f12b3e8 into main Dec 10, 2024
3 checks passed

ryanjjung deleted the issue-57 branch December 10, 2024 20:20

ryanjjung added this to the v0.0.9 milestone Dec 10, 2024

ryanjjung self-assigned this Dec 10, 2024

ryanjjung added documentation Improvements or additions to documentation enhancement New feature or request M: Pulumi Monitoring labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring update #66

Monitoring update #66

ryanjjung commented Dec 9, 2024

ryanjjung Dec 9, 2024

ryanjjung Dec 9, 2024

MelissaAutumn left a comment

MelissaAutumn Dec 10, 2024

MelissaAutumn Dec 10, 2024

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024 •

edited

Loading

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024

ryanjjung Dec 10, 2024

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024

ryanjjung Dec 10, 2024

MelissaAutumn left a comment

MelissaAutumn Dec 10, 2024

ryanjjung Dec 10, 2024

MelissaAutumn Dec 10, 2024

Monitoring update #66

Monitoring update #66

Conversation

ryanjjung commented Dec 9, 2024

Description of the Change

Benefits

Applicable Issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MelissaAutumn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanjjung Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MelissaAutumn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanjjung Dec 10, 2024 •

edited

Loading