Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: push prometheus metrics to pushgateway on-demand #1404

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Ekleog
Copy link

@Ekleog Ekleog commented Jan 20, 2025

Fixes #1403

Copy link
Member

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review.. From my side this is almost ready for merging, I just left a view comments.

General question: What do you think about which env variables are used? The "standard" is to use RUSTIC_XXX env variable for a xxx option. We have the pushgateway_url option and we have (only) env options for PUSHGATEWAY_USER and PUSHGATEWAY_PASS - I personally would have the later 2 also in the config and then we should decide about the env names. Or use RUSTIC_PUSHGATEWAY_* (by clap) if present and the PUSHGATEWAY_* as alternatives if they are commonly used variables for the pushgateway?

@Ekleog
Copy link
Author

Ekleog commented Jan 27, 2025

Thank you for your review!

About the authentication environment variables, I actually thought about it. My argument against having CLI flags with user/pass is, it's essentially insecure to use this way, because any process on your machine can then read the CLI flags, and thus bypass auth.

Especially when Pushgateway is likely often deployed on the same host it's keeping metrics for (and firewalled out of the internet), this seems to me like having its credentials in the CLI would basically completely kill any security the credentials would bring.

I just renamed them RUSTIC_* for consistency with clap.

Also, I would not at all be opposed to having these as a parameter in the config file, but TBH I have absolutely no idea how to actually do it without also exposing a clap argument — I haven't dived deep enough in rustic's config handling yet.

WDYT?

@aawsome
Copy link
Member

aawsome commented Feb 6, 2025

Hi @Ekleog - sorry for the late reply...

I overworked this PR a bit:

  • now uses the crate prometheus_push which allows to fine-tune the used reqwest. Especially it allows to build without openssl which may be a blocker for building on some targets.
  • I moved most of the options to GlobalOptions as I think prometheus metrics may be also interesting for other commands in future (like forget, prune, maybe check). This also includes moving the push code as a method of this struct.
  • Regarding the user/password discussion: We already have the --password option which is also not advisable to use. I think we should be consistent here to also have the pushgateway user/pass as CLI arguments.

Can you please test if this still works with my changes?

If you are fine with what I was doing, I think the only open points are to adapt unit tests and docu / config examples accordingly.

@Ekleog
Copy link
Author

Ekleog commented Feb 7, 2025

Hmm… for prometheus-push, are you really sure you need it for one of your supported targets? It seems to be very young, especially compared to the main prometheus crate; and a target with zero native-tls-supported target sounds specific enough to likely not be supported by rustic anyway… I think?

As for your other changes, they definitely make sense to me! I'll review and re-test as soon as you confirm where you want to go wrt. prometheus-push. I personally would not be willing to accept, in a project I maintain, a dependency with a first release 12 months ago, a last release 8 months ago, and that started for some reason at version 0.2.1, all that with less than 10k downloads. But that's your choice, so just let me know your decision and I'll follow! :)

@Ekleog
Copy link
Author

Ekleog commented Feb 7, 2025

Also, no worries at all about the reply delay, and thank you for your very involved answer!

@aawsome
Copy link
Member

aawsome commented Feb 7, 2025

Thanks for the discussion @Ekleog

I changed it now by copy&pasting few lines from the prometheus codebase which allows to use a self-defined reqwest without any additional dependency.

Hmm… for prometheus-push, are you really sure you need it for one of your supported targets? It seems to be very young, especially compared to the main prometheus crate; and a target with zero native-tls-supported target sounds specific enough to likely not be supported by rustic anyway… I think?

Ok, in fact it is more about the need to have something like libssl-dev on the system compiling rustic - and the runtime library then needed to run it. This is all source of potential trouble. And when it come to cross-compiling, things may even go wild..

@aawsome
Copy link
Member

aawsome commented Feb 7, 2025

@Ekleog

One thing regarding the job name: In case we want to monitor, lets say, backup and prune both with their own metric, would a use want to have identical job names here (like rustic-backup/rustic-prune) or would a common name be used (rustic)?
Currently the job name is a global option, if it may differ, it should be rather moved to the command...

And about labels: This is currently done such that users can give labels on the global level and addtional labels on the backup (and even specific to snapshots defined in the config profile) level. I hope this is general enough to suit everybody's need...

@Ekleog
Copy link
Author

Ekleog commented Feb 7, 2025

The changes make sense to me!

For the job name, I think (though I haven't checked) that the push gateway will delete old metrics upon getting new metrics with the same job, even if the old metrics was not the same.

So if people want to run backup then prune with no interval, but keep seeing the metrics until the next run of each, we'll need to have a different job name per command.

This being said it could still make sense to have it be in globaloptions. Thinking more, I'm actually thinking that we should not have it have any default value: forcing the user to configure it would solve all the questions, and it'd avoid users with multiple backup runs erasing their own metrics by default.

Does that make sense?

@Ekleog
Copy link
Author

Ekleog commented Feb 7, 2025

(as for the global/extra labels, TBH I'm thinking labels should be defined as CLI arguments usually because people may have multiple backup/forget jobs that'd want different labels, so I'd say I'm opinionless on this one)

@aawsome
Copy link
Member

aawsome commented Feb 7, 2025

@Ekleog about the CLI-options:
Of course everything can be given as CLI options, however the question is about how to enable it in the config profile such that users can only run rustic backup or rustic prune with everything already defined. So, I moved the job name to the backup options and did some finalizing including the unit tests and docu, see the full.toml or the Readme for an example how a config profile may look like.

I just encountered one (hopefully last) thing: You set the label path to the path, but in labels no / is allowed, so this can produce errors in most cases. Any idea how we should handle/escape the /?

Also, can you please test if everything is working as expected? I didn't setup a prometheus or pushgateway, so I can't really test it.. ;-)

@Ekleog
Copy link
Author

Ekleog commented Feb 8, 2025

I'm not sure I get what you mean by "no / allowed"? Prometheus label values can be any Unicode character

I'll test it out hopefully by the end of weekend, I have a lot of things to do right now unfortunately! I'll make sure to report here as soon as I have tested it 😄

@Ekleog
Copy link
Author

Ekleog commented Feb 8, 2025

Also, thank you very much for all your involvement in this PR! Sorry I've been underwater to answer you over the last few days, I'll do my best to give you a proper test asap ❤️

@aawsome
Copy link
Member

aawsome commented Feb 10, 2025

@Ekleog I changed the labels to use base64 encoding as described in https://github.com/prometheus/pushgateway/blob/master/README.md.

One thing: There are quite some default labels set like paths, uid, gid - and those can't be unset by users. I would actually prefer using a minimal set of labels - especially as users can always define their own labels. Maybe even use no label at all or restrict them to the ones which can be used for grouping: hostname, label (the rustic label) and paths. WDYT?

Also would be nice to have a tests against a running prometheus pushclient before we merge this.

@Ekleog
Copy link
Author

Ekleog commented Feb 10, 2025

That's a great idea, to avoid issues related to eg. quotes in paths or other labels' values!

As for limiting the default labels, I can confirm that hostname+label+paths would be enough for my personal use case! I actually only need an id label, that I would define myself on the backup, so 🤷

Just for the test against a running prometheus pushclient, do you mean a cargo unit test (which seems hard considering it'd require a running pushclient), or a CI job that'd run rustic once and confirm there are expected metrics?

@aawsome
Copy link
Member

aawsome commented Feb 10, 2025

As for limiting the default labels, I can confirm that hostname+label+paths would be enough for my personal use case! I actually only need an id label, that I would define myself on the backup, so 🤷

I restricted the default labels to hostname+label+paths+tags.

Just for the test against a running prometheus pushclient, do you mean a cargo unit test (which seems hard considering it'd require a running pushclient), or a CI job that'd run rustic once and confirm there are expected metrics?

I added a unit test. I think the only thing missing is just one manual test to verify that the format is really what pushclient can work with...

@aawsome
Copy link
Member

aawsome commented Feb 14, 2025

I installed a pushgateway and tested myself - needed a few corrections and added debug and a better testcase.

@Ekleog Only one thing I stumbled over: We are using Gauge for all values. I don't have experience with prometheus, but it seems there are other metric types. I wonder if we really want to use Gauge for all values or if we might differentiate. E.g. the dates are now saved as seconds since 1970, if I am correct - doesn't prometheus support a datetime metric?

Copy link
Author

@Ekleog Ekleog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, I'm impressed by how motivated you are on this PR, thank you for all the work you put in here!

For the metrics, here are the supported Prometheus metric types. In this case, I think gauge would be the right choice for all the metrics — because it's from a batch job, there's no meaningful "reset to 0" semantics for counters :)

Also, I did try switching my production system to the current tip of this PR (a65f2e4).

Everything seems to be working fine for me! 🎉 Though I still need to adjust the exact labels now that everything's going through the pushgateway, which broke my dashboards and alerts — hopefully I'll manage to get something working again soon, but it should not be an issue considering all metrics seem to properly have entered prometheus :)

I'll be very happy to switch back to a released version of rustic as soon as it lands 😄

.iter()
.map(|x| (x.0.as_str(), x.1.as_str()))
.chain(extra_labels.iter().map(|(a, b)| (a.as_str(), b.as_str())))
.collect();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… I think this order will lead to weird override mechanisms? Like it'd be, from least to most strongly binding:

  1. Global labels
  2. Hardcoded labels (paths, etc.)
  3. Backup-specific labels

It might be worth not adding the labels here, but only taking them from the function call, thus letting the caller deal with putting them with the right priority?

if let Err(err) = publish_metrics(&snap, self.prometheus_job, self.prometheus_labels) {
warn!("error pushing prometheus metrics: {err}");
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an error if prometheus was configured but the feature was not enabled at compile-time?

@Ekleog
Copy link
Author

Ekleog commented Feb 16, 2025

(I added some comments in a review to not touch the code you had written, but I'd be happy to fix them myself if you confirm they sound reasonable to you, next time I find some time to work on my personal backup system! :))

@Ekleog
Copy link
Author

Ekleog commented Feb 20, 2025

Ok, I can confirm that a few days after, all my metrics dashboards are adjusted and look the same as before; and backups seem to be working without issues either!

Just let me know if you want me to handle the comments I posted, I don't want to disrupt your workflow by pushing to the branch at an unexpected time 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON output as prometheus metric format?
2 participants