"eventfd failed" error on SPOD daemonset pods #2561

gsstuart · 2024-11-14T17:29:54Z

What happened:

About three hours ago all of our spod daemonset pods started to crash loop, all with similar error messages:

$ k logs spod-6zrg7
Defaulted container "security-profiles-operator" out of: security-profiles-operator, log-enricher, metrics, non-root-enabler (init)
runtime: eventfd failed with 18446744073709551615
fatal error: runtime: eventfd failed

What you expected to happen:

SPOD pods should run normally.

How to reproduce it (as minimally and precisely as possible):

I think perhaps related to a recent build? The pattern I see is that all of our older SPOD pods, all running image gcr.io/k8s-staging-sp-operator/security-profiles-operator@sha256:a9a912f30dc62baa229d6db54b9e49dcb87f7b7d4633958f481435180f1d8057, are running fine.

Replicas coming up within the past three hours have a different image id and all are erroring.

Anything else we need to know?:

full log output from one spod

runtime: eventfd failed with 18446744073709551615 fatal error: runtime: eventfd failed

goroutine 1 gp=0xc0000061c0 m=0 mp=0x51ea4c0 [running, locked to thread]:
runtime.throw({0x31dece7?, 0xc0000061c0?})
runtime/panic.go:1067 +0x48 fp=0xc00009eb08 sp=0xc00009ead8 pc=0x477488
runtime.netpollinit()
runtime/netpoll_epoll.go:31 +0x156 fp=0xc00009eb78 sp=0xc00009eb08 pc=0x439c56
runtime.netpollGenericInit()
runtime/netpoll.go:224 +0x35 fp=0xc00009eb90 sp=0xc00009eb78 pc=0x439155
internal/poll.runtime_pollServerInit()
runtime/netpoll.go:215 +0xf fp=0xc00009eba0 sp=0xc00009eb90 pc=0x47654f
sync.(*Once).doSlow(0xffffffffffffff9c?, 0xc00009ec30?)
sync/once.go:76 +0xb4 fp=0xc00009ec00 sp=0xc00009eba0 pc=0x48d514
sync.(*Once).Do(...)
sync/once.go:67
internal/poll.(*pollDesc).init(0xc000148140, 0xc000148120)
internal/poll/fd_poll_runtime.go:39 +0x3c fp=0xc00009ec20 sp=0xc00009ec00 pc=0x5388bc
internal/poll.(*FD).Init(0xc000148120, {0x31ab7af?, 0x0?}, 0xb0?)
internal/poll/fd_unix.go:66 +0x45 fp=0xc00009ec40 sp=0xc00009ec20 pc=0x5397a5
os.newFile(0x3, {0xc000156000, 0x1b}, 0x1, 0x0)
os/file_unix.go:237 +0x165 fp=0xc00009ec80 sp=0xc00009ec40 pc=0x548ce5
os.openFileNolog({0xc000156000, 0x1b}, 0x0, 0x156000?)
os/file_unix.go:297 +0x192 fp=0xc00009ed18 sp=0xc00009ec80 pc=0x548f32
os.OpenFile({0xc000156000, 0x1b}, 0x0, 0x0)
os/file.go:385 +0x3e fp=0xc00009ed48 sp=0xc00009ed18 pc=0x5468fe
os.Open(...)
os/file.go:365
google.golang.org/protobuf/internal/detrand.binaryHash()
google.golang.org/[email protected]/internal/detrand/rand.go:46 +0x53 fp=0xc00009ee10 sp=0xc00009ed48 pc=0x139f8d3
google.golang.org/protobuf/internal/detrand.init()
google.golang.org/[email protected]/internal/detrand/rand.go:38 +0xf fp=0xc00009ee20 sp=0xc00009ee10 pc=0x139f86f
runtime.doInit1(0x50fe5e0)
runtime/proc.go:7290 +0xe8 fp=0xc00009ef50 sp=0xc00009ee20 pc=0x44f528
runtime.doInit(...)
runtime/proc.go:7257
runtime.main()
runtime/proc.go:254 +0x345 fp=0xc00009efe0 sp=0xc00009ef50 pc=0x440ce5
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00009efe8 sp=0xc00009efe0 pc=0x47ffe1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:424 +0xce fp=0xc00008cfa8 sp=0xc00008cf88 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.forcegchelper()
runtime/proc.go:337 +0xb3 fp=0xc00008cfe0 sp=0xc00008cfa8 pc=0x440f73
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x47ffe1
created by runtime.init.7 in goroutine 1
runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:424 +0xce fp=0xc00008d780 sp=0xc00008d760 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.bgsweep(0xc0000ba000)
runtime/mgcsweep.go:277 +0x94 fp=0xc00008d7c8 sp=0xc00008d780 pc=0x4288b4
runtime.gcenable.gowrap1()
runtime/mgc.go:203 +0x25 fp=0xc00008d7e0 sp=0xc00008d7c8 pc=0x41cfe5
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008d7e8 sp=0xc00008d7e0 pc=0x47ffe1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc0000ba000?, 0x36c6dc8?, 0x1?, 0x0?, 0xc000007340?)
runtime/proc.go:424 +0xce fp=0xc00008df78 sp=0xc00008df58 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.(*scavengerState).park(0x51e7c20)
runtime/mgcscavenge.go:425 +0x49 fp=0xc00008dfa8 sp=0xc00008df78 pc=0x4262e9
runtime.bgscavenge(0xc0000ba000)
runtime/mgcscavenge.go:653 +0x3c fp=0xc00008dfc8 sp=0xc00008dfa8 pc=0x42685c
runtime.gcenable.gowrap2()
runtime/mgc.go:204 +0x25 fp=0xc00008dfe0 sp=0xc00008dfc8 pc=0x41cf85
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x47ffe1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000104700 m=nil [finalizer wait]:
runtime.gopark(0xc00008c648?, 0x412725?, 0xb0?, 0x1?, 0xc0000061c0?)
runtime/proc.go:424 +0xce fp=0xc00008c620 sp=0xc00008c600 pc=0x4775ae
runtime.runfinq()
runtime/mfinal.go:193 +0x107 fp=0xc00008c7e0 sp=0xc00008c620 pc=0x41c067
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008c7e8 sp=0xc00008c7e0 pc=0x47ffe1
created by runtime.createfing in goroutine 1
runtime/mfinal.go:163 +0x3d

Environment:

Cloud provider or hardware configuration: AWS EKS
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): 5.10.217-205.860.amzn2.x86_64
Others:

The text was updated successfully, but these errors were encountered:

ccojocar · 2024-11-14T17:53:33Z

The staging image wasn't pushed in a while https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/post-security-profiles-operator-push-image.

It started to work again now. Maybe you want to try one of the latest images pushed today.

gsstuart · 2024-11-14T20:24:57Z

Thanks, will take a look there... the interesting thing is that I'm not specifying any particular image to use, just the 0.8.4 helm chart, but I think internally it is specifying to pull a latest for the spod container. Will poke around on it some more...

zestysoft · 2024-11-15T02:07:19Z

I'm seeing this too -- it looks like pulling latest is still not helping?

 Normal   Pulled     7m10s                   kubelet            Successfully pulled image "gcr.io/k8s-staging-sp-operator/security-profiles-operator:latest" in 824ms (824ms including waiting)
  Normal   Started    7m9s (x3 over 7m29s)    kubelet            Started container security-profiles-operator
  Warning  BackOff    2m20s (x33 over 7m27s)  kubelet            Back-off restarting failed container security-profiles-operator in pod spod-vd8dd_security-profiles-operator(392ea67e-5c26-4fb3-85d7-9bd4be14b9ce)

saschagrunert · 2024-11-15T07:17:13Z

@gsstuart we updated the seccomp profile for the spod, do you mind updating the deployment to match?

security-profiles-operator/deploy/operator.yaml

Line 2927 in 0102ed0

"eventfd2",

zestysoft · 2024-11-15T17:21:41Z

I think the larger problem is that we're using a Helm chart that is pinned to a specific version, but the values.yaml file is using latest by default:

security-profiles-operator/deploy/helm/values.yaml

Line 7 in 4c0611e

tag: latest

overriding this to v0.8.4 fixes this problem.

gsstuart added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"eventfd failed" error on SPOD daemonset pods #2561

"eventfd failed" error on SPOD daemonset pods #2561

gsstuart commented Nov 14, 2024

ccojocar commented Nov 14, 2024

gsstuart commented Nov 14, 2024

zestysoft commented Nov 15, 2024

saschagrunert commented Nov 15, 2024 •

edited

Loading

zestysoft commented Nov 15, 2024

"eventfd failed" error on SPOD daemonset pods #2561

"eventfd failed" error on SPOD daemonset pods #2561

Comments

gsstuart commented Nov 14, 2024

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

ccojocar commented Nov 14, 2024

gsstuart commented Nov 14, 2024

zestysoft commented Nov 15, 2024

saschagrunert commented Nov 15, 2024 • edited Loading

zestysoft commented Nov 15, 2024

saschagrunert commented Nov 15, 2024 •

edited

Loading