Access node crashes with a panic log #6964

vishalchangrani · 2025-02-01T00:29:30Z

🐞 Bug Report

Node operator reported the following issue on the access node:

Panic: interface conversion: interface {} is , not string

goroutine 2377489 [running]:
github.com/onflow/flow-go/network/p2p/logging/internal.(*PeerIdCache).PeerIdString(0xc00009c710, {0xc141740900, 0x22})
        /app/network/p2p/logging/internal/peerIdCache.go:35 +0x105
github.com/onflow/flow-go/network/p2p/logging.PeerId(...)
        /app/network/p2p/logging/logging.go:26
github.com/onflow/flow-go/network/p2p/connection.(*ConnGater).InterceptPeerDial(0xc0009ae000, {0xc141740900, 0x22})
        /app/network/p2p/connection/connection_gater.go:75 +0x1d2
github.com/libp2p/go-libp2p/p2p/net/swarm.(*Swarm).dialPeer(0xc000ef6488, {0x307dcf8, 0xc0bdfd73b0}, {0xc141740900, 0x22})
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/net/swarm/swarm_dial.go:256 +0x233
github.com/libp2p/go-libp2p/p2p/net/swarm.(*Swarm).DialPeer(0x0?, {0x307dcf8?, 0xc0bdfd73b0?}, {0xc141740900?, 0xc00182b5ff?})
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/net/swarm/swarm_dial.go:228 +0x25
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).dialPeer(0xc000994600, {0x307dcf8, 0xc0bdfd73b0}, {0xc141740900, 0x22})
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:740 +0x12c
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).Connect(0xc000994600, {0x307dcf8, 0xc0bdfd73b0}, {{0xc141740900, 0x22}, {0x0, 0x0, 0x0}})
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:733 +0x105
github.com/libp2p/go-libp2p/p2p/discovery/backoff.(*BackoffConnector).Connect.func1({{0xc141740900, 0x22}, {0x0, 0x0, 0x0}})
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/discovery/backoff/backoffconnector.go:82 +0xcb
created by github.com/libp2p/go-libp2p/p2p/discovery/backoff.(*BackoffConnector).Connect in goroutine 2851
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/discovery/backoff/backoffconnector.go:78 +0x4d6

What is the severity of this bug?

Important:

Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.

Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.

Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.

Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.

Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).

Reproduction steps

Expected behaviour

Specifications

current image version gcr.io/flow-container-registry/access:v0.37.26-rc.3

Additional context

The text was updated successfully, but these errors were encountered:

AlexHentschel · 2025-02-03T23:11:35Z

The PeerIdCache, which is panicking here (line 35) is an optimization that we have added to the networking layer. Its very very odd that line 35 panics, because
the cache lives in-memory only. We can rule out anything related to an old data model that might be stored in the data base or so. The only relevant problem could be a mismatching type during the runtime.
the only place where we ever add something to the underlying cache is lines 38-39, which guarantees that only strings are ever added to the cache.

also took at look at the PeerIdCache test and extended the test a bit, by adding the following sub-tests

t.Run("attempting to re-put already cached peer ID", func(t *testing.T) {
  pid := unittest.PeerIdFixture(t)
  pidString := pid.String()
  pidStr := cache.PeerIdString(pid) // on first call, we cache
  assert.NotEmpty(t, pidStr)
  assert.Equal(t, pidString, pidStr)

  s := strings.Clone(string(pid))
  var pidClone peer.ID = peer.ID(s)
  pidStr = cache.PeerIdString(pidClone) // second call attempting to put an already cached ID
  assert.NotEmpty(t, pidStr)
  assert.Equal(t, pid.String(), pidStr)

  gotPidStr, ok := cache.ByPeerId(pid)
  assert.True(t, ok, "expected pid to be in the cache")
  assert.Equal(t, pid.String(), gotPidStr)
})

t.Run("caching empty peer ID and attempting to re-put already cached peer ID", func(t *testing.T) {
  pid := peer.ID("")
  pidString := pid.String()
  pidStr := cache.PeerIdString(pid) // on first call, we cache
  assert.Equal(t, pidString, pidStr)

  s := strings.Clone(string(pid))
  var pidClone peer.ID = peer.ID(s)
  pidStr = cache.PeerIdString(pidClone) // second call attempting to put an already cached ID
  assert.Equal(t, pid.String(), pidStr)

  gotPidStr, ok := cache.ByPeerId(pid)
  assert.True(t, ok, "expected pid to be in the cache")
  assert.Equal(t, pid.String(), gotPidStr)
})

they all succeed. Given the simplicity of the code together with the testing and analysis, I really have a hard time understanding how this crash could be possible. I am wondering if maybe their container is corrupted?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access node crashes with a panic log #6964

Access node crashes with a panic log #6964

vishalchangrani commented Feb 1, 2025

AlexHentschel commented Feb 3, 2025

Access node crashes with a panic log #6964

Access node crashes with a panic log #6964

Comments

vishalchangrani commented Feb 1, 2025

🐞 Bug Report

What is the severity of this bug?

Reproduction steps

Expected behaviour

Specifications

Additional context

AlexHentschel commented Feb 3, 2025