Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd.go: Added systemd health metric #113

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jpds
Copy link
Contributor

@jpds jpds commented Nov 13, 2023

Fixes: #112

@robryk
Copy link

robryk commented Nov 15, 2023

Hm~

I understand that you might prefer to still serve /metrics in that case for some reasons, but would like to understand those reasons.

The disadvantage of not just stopping to serve that I see is that this approach requires awareness from everyone who writes alerts based on this exporter, because apart from the standard alert on up being false they need to write a special other alert for this exporter. If there was a convention (maybe we could start it though?) across many exporters on how to report the health of the exporter itself (see prometheus-community/smartctl_exporter#91 (comment), which I forgot about and didn't open a different bug as they requested yet, for a similar issue), using that wouldn't have this problem.

Thoughts?

@jpds
Copy link
Contributor Author

jpds commented Nov 17, 2023

I understand that you might prefer to still serve /metrics in that case for some reasons, but would like to understand those reasons.

To me, a 5xx error would mean that someone made a serious programming error.

The disadvantage of not just stopping to serve that I see is that this approach requires awareness from everyone who writes alerts based on this exporter, because apart from the standard alert on up being false they need to write a special other alert for this exporter.

This is just a fact of life when it comes to different exporters exposing different metrics and the vast majority use their name as a prefix, I can see http_requests_total (for Thanos), prometheus_http_requests_total, and caddy_http_requests_total for example. You have to analyze what exporters provide you and build alerts/Grafana dashboards from there.

}

ch <- prometheus.MustNewConstMetric(c.scrapeDurationDesc, prometheus.GaugeValue, time.Since(begin).Seconds(), namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me, as it's not actually exposing metrics per collector that I can tell. namespace is always systemd, so the label would always be collector="systmed".

This means it's no different than up and scrape_duration_seconds recorded by Prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't have collectors up until this point - will fix that in a bit.

@SuperQ
Copy link
Contributor

SuperQ commented Jan 11, 2024

To me, a 5xx error would mean that someone made a serious programming error.

The standard for Prometheus is to return a 5xx error for failed scraps. Some exporters do what you're proposing here, but in general it's not encouraged to do this.

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase with the change to the new logging library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Silent failure on inability to connect to systemd
3 participants