Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

pulchart · 2024-08-16T18:21:29Z

What's wrong?

Hello,

I see a huge memory utilization (+ few GB) of Alloy service randomly in my environment. Sometimes it "kill" server with lower amount of RAM (~4-8GB).

I see this pattern in container_memory_rss:

acording to pyroscope the "issue" is in github.com/prometheus/node_exporter/collector.(*filesystemCollector).GetStats

I do not see

Steps to reproduce

run alloy as systemd service with prometheus.exporter.unix component.

System information

CentOS 9 Stream with Upstream linux kernel 6.9.y, 6.10.y,

Software version

Grafana Alloy 1.3

Configuration

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "filesystem",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "uname",
    "pressure",
    "processes",
    "stat",
    "os",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system.targets
  scrape_interval = "15s"
}

Logs

n/a

The text was updated successfully, but these errors were encountered:

pulchart · 2024-08-17T06:45:17Z

The node_filesystem metrics were not collected during the problematic period. Could the mount point have stalled?

I see a bug in node_exporter: prometheus/node_exporter#3063

fix filesystem mountTimeout not working,

Could it be related? Looks like mount_timeout option do not work?

pulchart · 2024-08-30T05:24:17Z

I was able to find a configuration which helps (or workaround?) the memory peaks utilization

I moved the filesystem collector out from others and scraping the metrics less often (15s->60s)

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "pressure",
    "processes",
    "stat",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
}

prometheus.exporter.unix "node_exporter_system_60s" {
  set_collectors = [
    "filesystem",
    "uname",
    "os",
  ]
  include_exporter_metrics = false
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system_15s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_15s.targets
  scrape_interval = "15s"
}

prometheus.scrape "node_exporter_system_60s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_60s.targets
  scrape_interval = "60s"
}

github-actions · 2024-10-06T00:01:32Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

pulchart added the bug Something isn't working label Aug 16, 2024

github-actions bot added the needs-attention label Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

pulchart commented Aug 16, 2024 •

edited

Loading

pulchart commented Aug 17, 2024 •

edited

Loading

pulchart commented Aug 30, 2024

github-actions bot commented Oct 6, 2024

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

Comments

pulchart commented Aug 16, 2024 • edited Loading

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

pulchart commented Aug 17, 2024 • edited Loading

pulchart commented Aug 30, 2024

github-actions bot commented Oct 6, 2024

pulchart commented Aug 16, 2024 •

edited

Loading

pulchart commented Aug 17, 2024 •

edited

Loading