Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring update #66

Merged
merged 16 commits into from
Dec 10, 2024
Merged

Monitoring update #66

merged 16 commits into from
Dec 10, 2024

Conversation

ryanjjung
Copy link
Collaborator

Description of the Change

This PR accomplishes several things, but the biggest of them is that a substantial chunk of high-priority alarms for our environments has been defined, and that the previously defined monitors have been improved upon. We now monitor for the following metrics going awry:

  • Increased 4xx errors in the CloudFront Distribution, indicating large numbers of attempts to access files that don't exist.
  • CloudFront Function CPU utilization, indicating that our rewrite function (which, for example, points requests for / to /index.html) is doing more work than it ought to be.
  • Target groups detecting unhealthy hosts, which tells us of problems in the Fargate service.
  • Target group errors, showing elevations of 5xx errors in a load balanced application.
  • Load balancer errors, showing 5xx errors coming from the load balancer itself, not inclusive of application errors.
  • Increased response times for load balanced applications.
  • CPU and memory utilization for ECS services, which would suggest we need to take some kind of scaling action (or solve a memory leak or infinite loop or something like that).

I was testing a lot of the documentation stuff while working through the new code, which led me to also take care of Issue #53 at the same time. I discovered a number of issues with parts of the documentation being inaccessible, so I fixed that as well.

Some other minor repairs are documented inline.

Benefits

Better insights into the operating state of our applications. A more streamlined and complete documentation experience. Bugs squashed.

Applicable Issues

#57, #53

@@ -131,6 +131,7 @@ def __init__(
'ecr:CompleteLayerUpload',
'ecr:DescribeImages',
'ecr:InitiateLayerUpload',
'ecr:GetDownloadUrlForLayer',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for automation to pull images from ECR.

# Build a security group that allows ingress on our listener port
security_group_with_rules = tb_pulumi.network.SecurityGroupWithRules(
f'{name}-sg',
project,
vpc_id=subnets[0].vpc_id,
vpc_id=primary_subnet.vpc_id,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change comes from feedback from @MelissaAutumn about referencing this by ID constantly instead of having a more representative name for the value.

Copy link
Member

@MelissaAutumn MelissaAutumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glanced over the PR, looks good but I don't know this codebase well enough. However I've added some comments!

alb_5xx_opts.update(self.overrides['alb_5xx'] if 'alb_5xx' in self.overrides else {})
alb_5xx_enabled = alb_5xx_opts['enabled']
del alb_5xx_opts['enabled']
alb_5xx = pulumi.Output.all(res_name=resource.name, res_suffix=resource.arn_suffix).apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rename, it's far more easier to determine that this means 500 errors than fivexx.


# Alert if response time is elevated over time
response_time_opts = {
'enabled': True,
'evaluation_periods': 2,
'period': 300,
'period': 60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems like a tweakable option. Would it make sense to pass this in as a param with a default val of 60?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not maybe a constant would do well here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are per-alarm defaults, but I could probably reorg a little bit. Say, create a global constant providing a default dict which then gets copied and tweaked per alarm. I'll poke around with that today and see if it makes the code read a little cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. My concern here was more so having like defaults that we may want to enforce differently without having to dig through each python module to tweak.

Copy link
Collaborator Author

@ryanjjung ryanjjung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this. Let's say we create a default dict in the constants.py file that looks something like this:

CLOUDWATCH_METRIC_ALARM_DEFAULTS = {
    'enabled': True,
    'evaluation_periods': 2,
    'period': 60,
    'statistic': 'Average',
    'threshold': 10
}

The statistic and threshold values will differ frequently, especially threshold. The statistic might be common enough to warrant an overridable constant, but the threshold will vary on almost every alarm. So on the usage side of this, the start of this pattern would go from one line defining the full default dict to multiple lines overriding the default dict:

target_5xx_opts = constants.CLOUDWATCH_METRIC_ALARM_DEFAULTS.copy()
target_5xx_opts.update({'statistic': 'Sum', 'threshold': 20})
target_5xx_opts.update(self.overrides['target_5xx'] if 'target_5xx' in self.overrides else {})
 ...

And then all the rest. So overall, changing the pattern in this way would add additional complexity.

I did go back over the docs and saw there was module-level documentation for this stuff, but not really any user-friendly explanations of how this all works. So I ended up writing a great deal of docs describing how to use this system, how to find the values it wants you to provide, etc.

I'm also going to add some tagging to these resources that will tell you explicitly what they are. I see a usage pattern where we build a monitoring module at defaults, then go figure out what monitors need tweaking. Those monitors could have tags on them with the necessary data for writing the configs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, we use defines.py in Appointment but constants also works!

Also I think you might be able to replace the double update with:

.update({
'statistic': 'Sum',
'threshold': 20,
**self.overrides.get('target_5xx', {})
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does look cleaner. I just pushed all the tagging stuff. I'll revisit this now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That refactor works really well, and makes quite a bit of this code cleaner. Great idea!

unhealthy_hosts_opts = {
'enabled': True,
'evaluation_periods': 2,
'period': 60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise tweakable option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, all of these options are tweakable through the YAML file. The line beneath this about "overrides" is checking the YAML config to see if a config for this particular resource has been provided. If so, it merges the options in. To me, the issue is one of readability with this being a more or less default dict with minimal changes from one alarm to the next. I'll push a proposed fix today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right that's what I get for skimming the PR at 11pm 😅
Thanks for the info!

project: tb_pulumi.ThunderbirdPulumiProject,
resource: aws.lb.load_balancer.LoadBalancer,
monitoring_group: CloudWatchMonitoringGroup,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only moved this to make it look like all the other function defintions.

@@ -101,7 +104,7 @@ def __init__(
)

# Build the load balancer first, as other resources must be attached to it later
nlb = aws.alb.LoadBalancer(
nlb = aws.lb.LoadBalancer(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actually the same thing under the hood, such that aws.lb is the authoritative library and aws.alb is just a reference to it. This leads to no actual change in the resource.

Copy link
Member

@MelissaAutumn MelissaAutumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Added a discussion point but it's not a blocker.

html_theme = 'insegel' # Clean black-on-white theme
if os.environ.get('TBPULUMI_DARK_MODE', False):
html_theme = 'furo' # Dark theme, easy on the eyes
html_theme = 'furo' # Clean, organized theme with dark and light modes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with furo, but our other readthedocs pages use the standard boring readthedocs theme. (In tb-accounts case mermaid extension doesn't mesh well with furo without some tweaks.) Should we discuss standardizing this in the next code / infra meeting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. It's easy enough to swap out to that theme. The lack of a dark mode drives me bonkers. My eyes can't handle all those white pixels. Totally up for discussion during an infra meeting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, I also hate light themes! It's kind of bonkers that the readthedocs theme doesn't have a dark mode via prefers-color-scheme.

@ryanjjung ryanjjung merged commit f12b3e8 into main Dec 10, 2024
3 checks passed
@ryanjjung ryanjjung deleted the issue-57 branch December 10, 2024 20:20
@ryanjjung ryanjjung added this to the v0.0.9 milestone Dec 10, 2024
@ryanjjung ryanjjung self-assigned this Dec 10, 2024
@ryanjjung ryanjjung added documentation Improvements or additions to documentation enhancement New feature or request M: Pulumi Monitoring labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request M: Pulumi Monitoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants