Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

Open
pulchart opened this issue Aug 16, 2024 · 3 comments
Labels
bug Something isn't working needs-attention

Comments

@pulchart
Copy link

pulchart commented Aug 16, 2024

What's wrong?

Hello,

I see a huge memory utilization (+ few GB) of Alloy service randomly in my environment. Sometimes it "kill" server with lower amount of RAM (~4-8GB).

I see this pattern in container_memory_rss:
container_memory_rss

acording to pyroscope the "issue" is in github.com/prometheus/node_exporter/collector.(*filesystemCollector).GetStats
memory-alloc_space

I do not see

Steps to reproduce

run alloy as systemd service with prometheus.exporter.unix component.

System information

CentOS 9 Stream with Upstream linux kernel 6.9.y, 6.10.y,

Software version

Grafana Alloy 1.3

Configuration

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "filesystem",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "uname",
    "pressure",
    "processes",
    "stat",
    "os",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system.targets
  scrape_interval = "15s"
}

Logs

n/a
@pulchart pulchart added the bug Something isn't working label Aug 16, 2024
@pulchart
Copy link
Author

pulchart commented Aug 17, 2024

The node_filesystem metrics were not collected during the problematic period. Could the mount point have stalled?node_filesystem_

I see a bug in node_exporter: prometheus/node_exporter#3063

fix filesystem mountTimeout not working,

Could it be related? Looks like mount_timeout option do not work?

@pulchart
Copy link
Author

I was able to find a configuration which helps (or workaround?) the memory peaks utilization

I moved the filesystem collector out from others and scraping the metrics less often (15s->60s)

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "pressure",
    "processes",
    "stat",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
}

prometheus.exporter.unix "node_exporter_system_60s" {
  set_collectors = [
    "filesystem",
    "uname",
    "os",
  ]
  include_exporter_metrics = false
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system_15s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_15s.targets
  scrape_interval = "15s"
}

prometheus.scrape "node_exporter_system_60s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_60s.targets
  scrape_interval = "60s"
}

Copy link
Contributor

github-actions bot commented Oct 6, 2024

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention
Projects
None yet
Development

No branches or pull requests

1 participant