Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial runbook entries for AM's alerts #577

Merged
merged 3 commits into from
Aug 21, 2023

Conversation

douglascamata
Copy link
Member

@douglascamata douglascamata commented Aug 21, 2023

Some additional reformatting happened because the TOC generation (which is automatic) now changed the way it's writing markdown lists.

This is partially based on https://runbooks.prometheus-operator.dev/runbooks/alertmanager.

@douglascamata douglascamata marked this pull request as ready for review August 21, 2023 10:11

### Impact

For users this means that their most recent update to alerts might not be currently in use. Ultimately, this means some of the alerts they have configured may not be firing as expected. Subsequente updates to Alertmanager configuration won't be picked up until the reload succeeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subsequente -> subsequent (language typo? :) )


### Steps

- In the OSD console for the affected cluster, find the Alertmanager Route. Check that it correctly points to the Alertmanager Service. Check that the Service correctly points to the **all** the Alertmanager pods. Open the Route's address, go to the "Status" tab, and note the IP addresses of the discovered Alertmanager instances. Check if they match the addresses of **all** the Alertmanager pods, none should be missing or mismatching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the Route really relevant here? I would have thought the impact here would be due to issues with the internal network and peering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mention to the route there is only to find and open the Alertmanager UI, nothing else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think that makes more sense. There is probably some kubectl command we could use to achieve same but this is fine too in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I'm making the text clearer that the purpose of finding the route is to open the AM UI.


### Summary

One of the Alertmanager instances in the cluster cannot send alerts to integrations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integrations -> receivers

Signed-off-by: Douglas Camata <[email protected]>
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod @philipgough's comments.

Really wish we could import this somehow, similar to mixin.

Signed-off-by: Douglas Camata <[email protected]>
@douglascamata douglascamata merged commit 0a98f7c into main Aug 21, 2023
1 check passed
@douglascamata douglascamata deleted the alertmanager-alert-runbooks branch August 21, 2023 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants