-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mixin: Fix cpu usage graph #3109
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Johannes Ziemke <[email protected]>
43b03e3
to
ebfbe0f
Compare
/ ignoring(cpu) group_left | ||
count without (cpu, mode) (node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"}) | ||
) | ||
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we don't want to include iowait
and steal
here.
If we want this to be stacked CPU utilization, we probably want this:
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval]))) | |
clamp( | |
avg without (mode) ( | |
1-rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval]) | |
), | |
0, | |
1 | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember a discussion about steal
. But I don't remember the details. (But it feels like I have included steal
for a reason.)
Why would you also toss iowait
? Isn't the whole idea here to find out if you don't utilize your CPUs for whatever reason, including being stuck waiting for IO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iowait
is not CPU time. It's accounted for in CPU metrics but it's actually phantom time when nothing is done. You can have both 100% idle
and 100% iowait
Leading to -100% CPU use in this calculation.
I think this was an attempt to do what the kubernetes-mixin is doing, but it has the regexp inverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/cputime.c?id=HEAD#n222 looks more like the wait time is either added to idle
or to iowait
. But I'm just shooting in the dark here. Maybe that's not the relevant code, or it gets changed later on its way to node_exporter.
See https://github.com/prometheus/node_exporter/pull/1448/files#r1741178030 for the normalization for the number of cores. |
And in case that was misunderstood: The query is meant to go up to 1 for the stacked result. So on a 16 core system, every single number in the query output would only be up to 1/16th indeed. |
No description provided.