Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Fix cpu usage graph #3109

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

discordianfish
Copy link
Member

No description provided.

Signed-off-by: Johannes Ziemke <[email protected]>
/ ignoring(cpu) group_left
count without (cpu, mode) (node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"})
)
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])))
Copy link
Member

@SuperQ SuperQ Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't want to include iowait and steal here.

If we want this to be stacked CPU utilization, we probably want this:

Suggested change
(1 - sum without (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])))
clamp(
avg without (mode) (
1-rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode="idle", instance="$instance", %(clusterLabel)s="$cluster"}[$__rate_interval])
),
0,
1
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember a discussion about steal. But I don't remember the details. (But it feels like I have included steal for a reason.)

Why would you also toss iowait? Isn't the whole idea here to find out if you don't utilize your CPUs for whatever reason, including being stuck waiting for IO?

Copy link
Member

@SuperQ SuperQ Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iowait is not CPU time. It's accounted for in CPU metrics but it's actually phantom time when nothing is done. You can have both 100% idle and 100% iowait Leading to -100% CPU use in this calculation.

I think this was an attempt to do what the kubernetes-mixin is doing, but it has the regexp inverted.

See: kubernetes-mixin/rules/node.libsonnet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/cputime.c?id=HEAD#n222 looks more like the wait time is either added to idle or to iowait. But I'm just shooting in the dark here. Maybe that's not the relevant code, or it gets changed later on its way to node_exporter.

@beorn7
Copy link
Member

beorn7 commented Sep 3, 2024

See https://github.com/prometheus/node_exporter/pull/1448/files#r1741178030 for the normalization for the number of cores.

@beorn7
Copy link
Member

beorn7 commented Sep 3, 2024

So this is a makeshift setup: My laptop with 16 "11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz" cores, running the Promtheus unit tests, using the original query from the mixin. As you can see, it nicely goes up all the way to 1.

image

@beorn7
Copy link
Member

beorn7 commented Sep 3, 2024

And in case that was misunderstood: The query is meant to go up to 1 for the stacked result. So on a 16 core system, every single number in the query output would only be up to 1/16th indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants