-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: push prometheus metrics to pushgateway on-demand #1404
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review.. From my side this is almost ready for merging, I just left a view comments.
General question: What do you think about which env variables are used? The "standard" is to use RUSTIC_XXX
env variable for a xxx
option. We have the pushgateway_url
option and we have (only) env options for PUSHGATEWAY_USER
and PUSHGATEWAY_PASS
- I personally would have the later 2 also in the config and then we should decide about the env names. Or use RUSTIC_PUSHGATEWAY_*
(by clap) if present and the PUSHGATEWAY_*
as alternatives if they are commonly used variables for the pushgateway?
Thank you for your review! About the authentication environment variables, I actually thought about it. My argument against having CLI flags with user/pass is, it's essentially insecure to use this way, because any process on your machine can then read the CLI flags, and thus bypass auth. Especially when Pushgateway is likely often deployed on the same host it's keeping metrics for (and firewalled out of the internet), this seems to me like having its credentials in the CLI would basically completely kill any security the credentials would bring. I just renamed them Also, I would not at all be opposed to having these as a parameter in the config file, but TBH I have absolutely no idea how to actually do it without also exposing a clap argument — I haven't dived deep enough in rustic's config handling yet. WDYT? |
Hi @Ekleog - sorry for the late reply... I overworked this PR a bit:
Can you please test if this still works with my changes? If you are fine with what I was doing, I think the only open points are to adapt unit tests and docu / config examples accordingly. |
Hmm… for As for your other changes, they definitely make sense to me! I'll review and re-test as soon as you confirm where you want to go wrt. prometheus-push. I personally would not be willing to accept, in a project I maintain, a dependency with a first release 12 months ago, a last release 8 months ago, and that started for some reason at version 0.2.1, all that with less than 10k downloads. But that's your choice, so just let me know your decision and I'll follow! :) |
Also, no worries at all about the reply delay, and thank you for your very involved answer! |
Thanks for the discussion @Ekleog I changed it now by copy&pasting few lines from the prometheus codebase which allows to use a self-defined reqwest without any additional dependency.
Ok, in fact it is more about the need to have something like libssl-dev on the system compiling rustic - and the runtime library then needed to run it. This is all source of potential trouble. And when it come to cross-compiling, things may even go wild.. |
One thing regarding the job name: In case we want to monitor, lets say, And about labels: This is currently done such that users can give labels on the global level and addtional labels on the backup (and even specific to snapshots defined in the config profile) level. I hope this is general enough to suit everybody's need... |
The changes make sense to me! For the job name, I think (though I haven't checked) that the push gateway will delete old metrics upon getting new metrics with the same job, even if the old metrics was not the same. So if people want to run backup then prune with no interval, but keep seeing the metrics until the next run of each, we'll need to have a different job name per command. This being said it could still make sense to have it be in globaloptions. Thinking more, I'm actually thinking that we should not have it have any default value: forcing the user to configure it would solve all the questions, and it'd avoid users with multiple backup runs erasing their own metrics by default. Does that make sense? |
(as for the global/extra labels, TBH I'm thinking labels should be defined as CLI arguments usually because people may have multiple backup/forget jobs that'd want different labels, so I'd say I'm opinionless on this one) |
@Ekleog about the CLI-options: I just encountered one (hopefully last) thing: You set the label Also, can you please test if everything is working as expected? I didn't setup a prometheus or pushgateway, so I can't really test it.. ;-) |
I'm not sure I get what you mean by "no / allowed"? Prometheus label values can be any Unicode character I'll test it out hopefully by the end of weekend, I have a lot of things to do right now unfortunately! I'll make sure to report here as soon as I have tested it 😄 |
Also, thank you very much for all your involvement in this PR! Sorry I've been underwater to answer you over the last few days, I'll do my best to give you a proper test asap ❤️ |
@Ekleog I changed the labels to use base64 encoding as described in https://github.com/prometheus/pushgateway/blob/master/README.md. One thing: There are quite some default labels set like paths, uid, gid - and those can't be unset by users. I would actually prefer using a minimal set of labels - especially as users can always define their own labels. Maybe even use no label at all or restrict them to the ones which can be used for grouping: Also would be nice to have a tests against a running prometheus pushclient before we merge this. |
That's a great idea, to avoid issues related to eg. quotes in paths or other labels' values! As for limiting the default labels, I can confirm that hostname+label+paths would be enough for my personal use case! I actually only need an Just for the test against a running prometheus pushclient, do you mean a cargo unit test (which seems hard considering it'd require a running pushclient), or a CI job that'd run rustic once and confirm there are expected metrics? |
I restricted the default labels to hostname+label+paths+tags.
I added a unit test. I think the only thing missing is just one manual test to verify that the format is really what pushclient can work with... |
I installed a pushgateway and tested myself - needed a few corrections and added debug and a better testcase. @Ekleog Only one thing I stumbled over: We are using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, I'm impressed by how motivated you are on this PR, thank you for all the work you put in here!
For the metrics, here are the supported Prometheus metric types. In this case, I think gauge would be the right choice for all the metrics — because it's from a batch job, there's no meaningful "reset to 0" semantics for counters :)
Also, I did try switching my production system to the current tip of this PR (a65f2e4).
Everything seems to be working fine for me! 🎉 Though I still need to adjust the exact labels now that everything's going through the pushgateway, which broke my dashboards and alerts — hopefully I'll manage to get something working again soon, but it should not be an issue considering all metrics seem to properly have entered prometheus :)
I'll be very happy to switch back to a released version of rustic as soon as it lands 😄
.iter() | ||
.map(|x| (x.0.as_str(), x.1.as_str())) | ||
.chain(extra_labels.iter().map(|(a, b)| (a.as_str(), b.as_str()))) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm… I think this order will lead to weird override mechanisms? Like it'd be, from least to most strongly binding:
- Global labels
- Hardcoded labels (paths, etc.)
- Backup-specific labels
It might be worth not adding the labels here, but only taking them from the function call, thus letting the caller deal with putting them with the right priority?
if let Err(err) = publish_metrics(&snap, self.prometheus_job, self.prometheus_labels) { | ||
warn!("error pushing prometheus metrics: {err}"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an error if prometheus was configured but the feature was not enabled at compile-time?
(I added some comments in a review to not touch the code you had written, but I'd be happy to fix them myself if you confirm they sound reasonable to you, next time I find some time to work on my personal backup system! :)) |
Ok, I can confirm that a few days after, all my metrics dashboards are adjusted and look the same as before; and backups seem to be working without issues either! Just let me know if you want me to handle the comments I posted, I don't want to disrupt your workflow by pushing to the branch at an unexpected time 😄 |
Fixes #1403