Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"eventfd failed" error on SPOD daemonset pods #2561

Open
gsstuart opened this issue Nov 14, 2024 · 5 comments
Open

"eventfd failed" error on SPOD daemonset pods #2561

gsstuart opened this issue Nov 14, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@gsstuart
Copy link
Contributor

What happened:

About three hours ago all of our spod daemonset pods started to crash loop, all with similar error messages:

$ k logs spod-6zrg7
Defaulted container "security-profiles-operator" out of: security-profiles-operator, log-enricher, metrics, non-root-enabler (init)
runtime: eventfd failed with 18446744073709551615
fatal error: runtime: eventfd failed

What you expected to happen:

SPOD pods should run normally.

How to reproduce it (as minimally and precisely as possible):

I think perhaps related to a recent build? The pattern I see is that all of our older SPOD pods, all running image gcr.io/k8s-staging-sp-operator/security-profiles-operator@sha256:a9a912f30dc62baa229d6db54b9e49dcb87f7b7d4633958f481435180f1d8057, are running fine.

Replicas coming up within the past three hours have a different image id and all are erroring.

Anything else we need to know?:

full log output from one spod runtime: eventfd failed with 18446744073709551615 fatal error: runtime: eventfd failed

goroutine 1 gp=0xc0000061c0 m=0 mp=0x51ea4c0 [running, locked to thread]:
runtime.throw({0x31dece7?, 0xc0000061c0?})
runtime/panic.go:1067 +0x48 fp=0xc00009eb08 sp=0xc00009ead8 pc=0x477488
runtime.netpollinit()
runtime/netpoll_epoll.go:31 +0x156 fp=0xc00009eb78 sp=0xc00009eb08 pc=0x439c56
runtime.netpollGenericInit()
runtime/netpoll.go:224 +0x35 fp=0xc00009eb90 sp=0xc00009eb78 pc=0x439155
internal/poll.runtime_pollServerInit()
runtime/netpoll.go:215 +0xf fp=0xc00009eba0 sp=0xc00009eb90 pc=0x47654f
sync.(*Once).doSlow(0xffffffffffffff9c?, 0xc00009ec30?)
sync/once.go:76 +0xb4 fp=0xc00009ec00 sp=0xc00009eba0 pc=0x48d514
sync.(*Once).Do(...)
sync/once.go:67
internal/poll.(*pollDesc).init(0xc000148140, 0xc000148120)
internal/poll/fd_poll_runtime.go:39 +0x3c fp=0xc00009ec20 sp=0xc00009ec00 pc=0x5388bc
internal/poll.(*FD).Init(0xc000148120, {0x31ab7af?, 0x0?}, 0xb0?)
internal/poll/fd_unix.go:66 +0x45 fp=0xc00009ec40 sp=0xc00009ec20 pc=0x5397a5
os.newFile(0x3, {0xc000156000, 0x1b}, 0x1, 0x0)
os/file_unix.go:237 +0x165 fp=0xc00009ec80 sp=0xc00009ec40 pc=0x548ce5
os.openFileNolog({0xc000156000, 0x1b}, 0x0, 0x156000?)
os/file_unix.go:297 +0x192 fp=0xc00009ed18 sp=0xc00009ec80 pc=0x548f32
os.OpenFile({0xc000156000, 0x1b}, 0x0, 0x0)
os/file.go:385 +0x3e fp=0xc00009ed48 sp=0xc00009ed18 pc=0x5468fe
os.Open(...)
os/file.go:365
google.golang.org/protobuf/internal/detrand.binaryHash()
google.golang.org/[email protected]/internal/detrand/rand.go:46 +0x53 fp=0xc00009ee10 sp=0xc00009ed48 pc=0x139f8d3
google.golang.org/protobuf/internal/detrand.init()
google.golang.org/[email protected]/internal/detrand/rand.go:38 +0xf fp=0xc00009ee20 sp=0xc00009ee10 pc=0x139f86f
runtime.doInit1(0x50fe5e0)
runtime/proc.go:7290 +0xe8 fp=0xc00009ef50 sp=0xc00009ee20 pc=0x44f528
runtime.doInit(...)
runtime/proc.go:7257
runtime.main()
runtime/proc.go:254 +0x345 fp=0xc00009efe0 sp=0xc00009ef50 pc=0x440ce5
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00009efe8 sp=0xc00009efe0 pc=0x47ffe1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:424 +0xce fp=0xc00008cfa8 sp=0xc00008cf88 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.forcegchelper()
runtime/proc.go:337 +0xb3 fp=0xc00008cfe0 sp=0xc00008cfa8 pc=0x440f73
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x47ffe1
created by runtime.init.7 in goroutine 1
runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
runtime/proc.go:424 +0xce fp=0xc00008d780 sp=0xc00008d760 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.bgsweep(0xc0000ba000)
runtime/mgcsweep.go:277 +0x94 fp=0xc00008d7c8 sp=0xc00008d780 pc=0x4288b4
runtime.gcenable.gowrap1()
runtime/mgc.go:203 +0x25 fp=0xc00008d7e0 sp=0xc00008d7c8 pc=0x41cfe5
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008d7e8 sp=0xc00008d7e0 pc=0x47ffe1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc0000ba000?, 0x36c6dc8?, 0x1?, 0x0?, 0xc000007340?)
runtime/proc.go:424 +0xce fp=0xc00008df78 sp=0xc00008df58 pc=0x4775ae
runtime.goparkunlock(...)
runtime/proc.go:430
runtime.(*scavengerState).park(0x51e7c20)
runtime/mgcscavenge.go:425 +0x49 fp=0xc00008dfa8 sp=0xc00008df78 pc=0x4262e9
runtime.bgscavenge(0xc0000ba000)
runtime/mgcscavenge.go:653 +0x3c fp=0xc00008dfc8 sp=0xc00008dfa8 pc=0x42685c
runtime.gcenable.gowrap2()
runtime/mgc.go:204 +0x25 fp=0xc00008dfe0 sp=0xc00008dfc8 pc=0x41cf85
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x47ffe1
created by runtime.gcenable in goroutine 1
runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000104700 m=nil [finalizer wait]:
runtime.gopark(0xc00008c648?, 0x412725?, 0xb0?, 0x1?, 0xc0000061c0?)
runtime/proc.go:424 +0xce fp=0xc00008c620 sp=0xc00008c600 pc=0x4775ae
runtime.runfinq()
runtime/mfinal.go:193 +0x107 fp=0xc00008c7e0 sp=0xc00008c620 pc=0x41c067
runtime.goexit({})
runtime/asm_amd64.s:1700 +0x1 fp=0xc00008c7e8 sp=0xc00008c7e0 pc=0x47ffe1
created by runtime.createfing in goroutine 1
runtime/mfinal.go:163 +0x3d

Environment:

  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 5.10.217-205.860.amzn2.x86_64
  • Others:
@gsstuart gsstuart added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2024
@ccojocar
Copy link
Contributor

The staging image wasn't pushed in a while https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/post-security-profiles-operator-push-image.

It started to work again now. Maybe you want to try one of the latest images pushed today.

@gsstuart
Copy link
Contributor Author

Thanks, will take a look there... the interesting thing is that I'm not specifying any particular image to use, just the 0.8.4 helm chart, but I think internally it is specifying to pull a latest for the spod container. Will poke around on it some more...

@zestysoft
Copy link

I'm seeing this too -- it looks like pulling latest is still not helping?

 Normal   Pulled     7m10s                   kubelet            Successfully pulled image "gcr.io/k8s-staging-sp-operator/security-profiles-operator:latest" in 824ms (824ms including waiting)
  Normal   Started    7m9s (x3 over 7m29s)    kubelet            Started container security-profiles-operator
  Warning  BackOff    2m20s (x33 over 7m27s)  kubelet            Back-off restarting failed container security-profiles-operator in pod spod-vd8dd_security-profiles-operator(392ea67e-5c26-4fb3-85d7-9bd4be14b9ce)

@saschagrunert
Copy link
Member

saschagrunert commented Nov 15, 2024

@gsstuart we updated the seccomp profile for the spod, do you mind updating the deployment to match?

@zestysoft
Copy link

I think the larger problem is that we're using a Helm chart that is pinned to a specific version, but the values.yaml file is using latest by default:

overriding this to v0.8.4 fixes this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants