-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring update #66
Conversation
@@ -131,6 +131,7 @@ def __init__( | |||
'ecr:CompleteLayerUpload', | |||
'ecr:DescribeImages', | |||
'ecr:InitiateLayerUpload', | |||
'ecr:GetDownloadUrlForLayer', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for automation to pull images from ECR.
# Build a security group that allows ingress on our listener port | ||
security_group_with_rules = tb_pulumi.network.SecurityGroupWithRules( | ||
f'{name}-sg', | ||
project, | ||
vpc_id=subnets[0].vpc_id, | ||
vpc_id=primary_subnet.vpc_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change comes from feedback from @MelissaAutumn about referencing this by ID constantly instead of having a more representative name for the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glanced over the PR, looks good but I don't know this codebase well enough. However I've added some comments!
alb_5xx_opts.update(self.overrides['alb_5xx'] if 'alb_5xx' in self.overrides else {}) | ||
alb_5xx_enabled = alb_5xx_opts['enabled'] | ||
del alb_5xx_opts['enabled'] | ||
alb_5xx = pulumi.Output.all(res_name=resource.name, res_suffix=resource.arn_suffix).apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the rename, it's far more easier to determine that this means 500 errors than fivexx.
tb_pulumi/cloudwatch.py
Outdated
|
||
# Alert if response time is elevated over time | ||
response_time_opts = { | ||
'enabled': True, | ||
'evaluation_periods': 2, | ||
'period': 300, | ||
'period': 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seems like a tweakable option. Would it make sense to pass this in as a param with a default val of 60?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not maybe a constant would do well here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are per-alarm defaults, but I could probably reorg a little bit. Say, create a global constant providing a default dict which then gets copied and tweaked per alarm. I'll poke around with that today and see if it makes the code read a little cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me. My concern here was more so having like defaults that we may want to enforce differently without having to dig through each python module to tweak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at this. Let's say we create a default dict in the constants.py
file that looks something like this:
CLOUDWATCH_METRIC_ALARM_DEFAULTS = {
'enabled': True,
'evaluation_periods': 2,
'period': 60,
'statistic': 'Average',
'threshold': 10
}
The statistic
and threshold
values will differ frequently, especially threshold
. The statistic
might be common enough to warrant an overridable constant, but the threshold
will vary on almost every alarm. So on the usage side of this, the start of this pattern would go from one line defining the full default dict to multiple lines overriding the default dict:
target_5xx_opts = constants.CLOUDWATCH_METRIC_ALARM_DEFAULTS.copy()
target_5xx_opts.update({'statistic': 'Sum', 'threshold': 20})
target_5xx_opts.update(self.overrides['target_5xx'] if 'target_5xx' in self.overrides else {})
...
And then all the rest. So overall, changing the pattern in this way would add additional complexity.
I did go back over the docs and saw there was module-level documentation for this stuff, but not really any user-friendly explanations of how this all works. So I ended up writing a great deal of docs describing how to use this system, how to find the values it wants you to provide, etc.
I'm also going to add some tagging to these resources that will tell you explicitly what they are. I see a usage pattern where we build a monitoring module at defaults, then go figure out what monitors need tweaking. Those monitors could have tags on them with the necessary data for writing the configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, we use defines.py in Appointment but constants also works!
Also I think you might be able to replace the double update with:
.update({
'statistic': 'Sum',
'threshold': 20,
**self.overrides.get('target_5xx', {})
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does look cleaner. I just pushed all the tagging stuff. I'll revisit this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That refactor works really well, and makes quite a bit of this code cleaner. Great idea!
tb_pulumi/cloudwatch.py
Outdated
unhealthy_hosts_opts = { | ||
'enabled': True, | ||
'evaluation_periods': 2, | ||
'period': 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise tweakable option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, all of these options are tweakable through the YAML file. The line beneath this about "overrides" is checking the YAML config to see if a config for this particular resource has been provided. If so, it merges the options in. To me, the issue is one of readability with this being a more or less default dict with minimal changes from one alarm to the next. I'll push a proposed fix today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right that's what I get for skimming the PR at 11pm 😅
Thanks for the info!
project: tb_pulumi.ThunderbirdPulumiProject, | ||
resource: aws.lb.load_balancer.LoadBalancer, | ||
monitoring_group: CloudWatchMonitoringGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only moved this to make it look like all the other function defintions.
@@ -101,7 +104,7 @@ def __init__( | |||
) | |||
|
|||
# Build the load balancer first, as other resources must be attached to it later | |||
nlb = aws.alb.LoadBalancer( | |||
nlb = aws.lb.LoadBalancer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are actually the same thing under the hood, such that aws.lb
is the authoritative library and aws.alb
is just a reference to it. This leads to no actual change in the resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Added a discussion point but it's not a blocker.
html_theme = 'insegel' # Clean black-on-white theme | ||
if os.environ.get('TBPULUMI_DARK_MODE', False): | ||
html_theme = 'furo' # Dark theme, easy on the eyes | ||
html_theme = 'furo' # Clean, organized theme with dark and light modes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with furo, but our other readthedocs pages use the standard boring readthedocs theme. (In tb-accounts case mermaid extension doesn't mesh well with furo without some tweaks.) Should we discuss standardizing this in the next code / infra meeting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. It's easy enough to swap out to that theme. The lack of a dark mode drives me bonkers. My eyes can't handle all those white pixels. Totally up for discussion during an infra meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries, I also hate light themes! It's kind of bonkers that the readthedocs theme doesn't have a dark mode via prefers-color-scheme.
Description of the Change
This PR accomplishes several things, but the biggest of them is that a substantial chunk of high-priority alarms for our environments has been defined, and that the previously defined monitors have been improved upon. We now monitor for the following metrics going awry:
/
to/index.html
) is doing more work than it ought to be.I was testing a lot of the documentation stuff while working through the new code, which led me to also take care of Issue #53 at the same time. I discovered a number of issues with parts of the documentation being inaccessible, so I fixed that as well.
Some other minor repairs are documented inline.
Benefits
Better insights into the operating state of our applications. A more streamlined and complete documentation experience. Bugs squashed.
Applicable Issues
#57, #53