Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize metrics reported based on cgroup in all APM agents #814

Open
7 tasks
gregkalapos opened this issue Jun 29, 2023 · 0 comments
Open
7 tasks

Normalize metrics reported based on cgroup in all APM agents #814

gregkalapos opened this issue Jun 29, 2023 · 0 comments

Comments

@gregkalapos
Copy link
Contributor

gregkalapos commented Jun 29, 2023

Description

Problem

Currently all our APM Agents report memory and CPU usage on some way. When the application runs directly on a physical or virtual machine, metrics reported are typically correct.

However, our metric-story when the monitored application runs within a container and we report memory/CPU usage based on cgroups is not that clear.

Specific issues:

  1. Currently most agents send memory metrics based on cgroups (system.process.cgroup.memory.mem.limit.bytes and system.process.cgroup.memory.mem.usage.bytes), but not all of them send CPU related metrics based on cgroups (e.g. Java does - it uses an API which takes container CPU usage into consideration). There are 2 problems with this: 1) our charts will be inconsistent: we may chart memory usage from the point of a container but chart CPU usage from the host's point of view and 2) APM Agents may report different CPU usage compared to other components - e.g. we know that metricbeat and some APM Agents report different CPU usage in Kubernetes. Here is a list of Kubernetes related CPU usage metrics.
  2. The UI team currently does a fairly complex calculation to calculate the total memory usage - the complexity here is added again by memory related metrics based on cgroup. The key issue is in percentCgroupMemoryUsedScript - the system.process.cgroup.memory.mem.limit.bytes may not be set, which means a pod can take up all the memory of the host. But in this case agents typically send the "magic value" of 9223372036854771712L, which is then "fixed" by the UI. While this works for our UI, for a user it's almost impossible to recreate a correct memory usage graph when an APM Agent sends memory related metrics based on cgroups.

Solution

  1. For the issue 1 above, each APM Agent must make sure that if the application runs within a container, the CPU usage reported is from the POD's point of view. So system.cpu.total.norm.pct and system.process.cpu.total.norm.pct should be aligned with the reported cgroup memory metrics. For some agents this may be already implemented and no action is needed on this.
  2. For the issue 2 above, decrease the complexity of the UI codeI by sending "real" system.process.cgroup.memory.mem.limit.bytes. This means, if no memory limit is set for the POD, the agent won't send the "magic" 9223372036854771712L value, instead, it'll send system.memory.total in system.process.cgroup.memory.mem.limit.bytes. This means the max memory that a pod can take is the same as the overall system memory. We also need to update the agent spec to describe this behaviour.

Spec Issue

Agent Issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant