Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLAB random setup-vpcs errors when configuring server VMs #292

Open
pau-hedgehog opened this issue Dec 30, 2024 · 10 comments · May be fixed by #354
Open

VLAB random setup-vpcs errors when configuring server VMs #292

pau-hedgehog opened this issue Dec 30, 2024 · 10 comments · May be fixed by #354
Assignees

Comments

@pau-hedgehog
Copy link
Contributor

pau-hedgehog commented Dec 30, 2024

There are known issues with the hhnet script: https://github.com/githedgehog/fabricator/blob/master/pkg/hhfab/hhnet.sh

23:54:56 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-08\": running hhnet \"bond 1008 enp2s1 enp2s2\": Process exited with status 1: out: Failed to get IP address for bond0.1008\n"

But as it's hitting the CI from time to time it needs to be addressed to make the CI more reliable and avoid having to retry the job:
https://github.com/githedgehog/fabricator/actions/runs/12530586269/job/34947281228

@pau-hedgehog pau-hedgehog self-assigned this Dec 30, 2024
@pau-hedgehog
Copy link
Contributor Author

Another hit:
https://github.com/githedgehog/fabricator/actions/runs/12589851894/job/35090547613#step:7:489

00:01:41 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-02\": running hhnet \"bond 1002 enp2s1 enp2s2\": Process exited with status 1: out: Failed to get IP address for bond0.1002\n"

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 4, 2025

So the error happens during this line execution:

bin/hhfab vlab up -v --ready setup-vpcs --ready test-connectivity --ready exit --mode=${{ matrix.buildmode }}

It would be helpful to capture the state (logs, outputs) of the VLAB just after the failure to be able to analyse the root cause.

As hhfab vlab up and successive commands run in the same context, currently there is no way to gather logs unless we implement the artifact gathering inside fabricator.

@Frostman , to progress on this one, what's your preferred approach? Expand hhfab helpers or allow hhfab to (also) run "detached" (eg. a Service interface) and allow artifact gathering externally?

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 20, 2025

I could reproduce this in env-3

14:45:47 INF Configuring VPCs and VPCAttachments
14:45:47 INF Configuring networking on servers
14:45:47 DBG Verified server=server-7
...
14:45:47 DBG Verified server=server-9
14:45:59 INF Configured server=server-5 addr=10.30.5.2/24 netconf="vlan 1005 enp2s1"
14:45:59 INF Configured server=server-7 addr=10.30.7.2/24 netconf="vlan 1007 enp2s1"
14:46:01 DBG Show tech collected successfully entry=server-4 output=/home/ubuntu/hhfab/show-tech-output/server-4-show-tech.log
14:46:01 DBG Show tech collected successfully entry=server-8 output=/home/ubuntu/hhfab/show-tech-output/server-8-show-tech.log
...
14:46:34 DBG Show tech collected successfully entry=as7712-01 output=/home/ubuntu/hhfab/show-tech-output/as7712-01-show-tech.log
14:46:36 DBG Show tech collected successfully entry=as7712-02 output=/home/ubuntu/hhfab/show-tech-output/as7712-02-show-tech.log
14:46:36 INF Show tech files saved in folder=/home/ubuntu/hhfab/show-tech-output
14:46:36 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-2\": running hhnet \"bond 1002 enp2s1 enp2s2\": context deadline exceeded: out: "

I could see this in the server logs:

-- Boot 247ddf04ef154360af24543a8535b9d6 --
Jan 20 14:30:38 server-2 systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 20 14:30:38 server-2 systemd-networkd[1250]: lo: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: lo: Gained carrier
Jan 20 14:30:38 server-2 systemd-networkd[1250]: Enumeration completed
Jan 20 14:30:38 server-2 systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Gained carrier
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s2: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s2: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s1: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s1: Link UP
Jan 20 14:30:39 server-2 systemd-networkd[1250]: enp2s0: DHCPv4 address 172.31.3.10/24 acquired from 172.31.3.2
Jan 20 14:30:40 server-2 systemd-networkd[1250]: enp2s0: Gained IPv6LL
Jan 20 14:42:38 server-2 systemd-networkd[1250]: enp2s2: Gained carrier
Jan 20 14:42:39 server-2 systemd-networkd[1250]: enp2s2: Gained IPv6LL
Jan 20 14:43:44 server-2 systemd-networkd[1250]: enp2s1: Gained carrier
Jan 20 14:43:46 server-2 systemd-networkd[1250]: enp2s1: Gained IPv6LL
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: Link DOWN
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: Lost carrier
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: DHCPv6 lease lost
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: Link DOWN
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: Lost carrier
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: DHCPv6 lease lost
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s1: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s2: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s1: Gained carrier
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Gained carrier
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s2: Gained carrier
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Link UP
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Gained carrier
Jan 20 14:45:55 server-2 systemd-networkd[1250]: bond0: Gained IPv6LL
Jan 20 14:45:56 server-2 systemd-networkd[1250]: bond0.1002: Gained IPv6LL

Around that time I see an error in the Agent of (s5248-05) and it looks like it restarts:

time=2025-01-20T14:44:23.745Z level=DEBUG msg=Action idx=269 weight=69 summary="Create DHCP relay Vlan1001" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1001]"
time=2025-01-20T14:44:23.870Z level=INFO msg="Action applied" idx=269 summary="Create DHCP relay Vlan1001"
time=2025-01-20T14:44:23.870Z level=DEBUG msg=Action idx=270 weight=69 summary="Create DHCP relay Vlan1002" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1002]"
time=2025-01-20T14:44:24.079Z level=INFO msg="Action applied" idx=270 summary="Create DHCP relay Vlan1002"
time=2025-01-20T14:44:24.079Z level=DEBUG msg=Action idx=271 weight=69 summary="Create DHCP relay Vlan1003" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1003]"
time=2025-01-20T14:44:24.227Z level=INFO msg="Action applied" idx=271 summary="Create DHCP relay Vlan1003"
time=2025-01-20T14:44:24.227Z level=DEBUG msg=Action idx=272 weight=69 summary="Create DHCP relay Vlan1010" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1010]"
time=2025-01-20T14:44:24.327Z level=INFO msg="Action applied" idx=272 summary="Create DHCP relay Vlan1010"
time=2025-01-20T14:44:24.328Z level=INFO msg="Config applied" name=s5248-05 gen=3 res=10712 took=1m30.48444858s
time=2025-01-20T14:44:24.328Z level=DEBUG msg="Ensuring alloy is installed and running"
time=2025-01-20T14:44:28.108Z level=DEBUG msg="Alloy ensured" took=946.87283ms
time=2025-01-20T14:44:28.109Z level=INFO msg="IDs ready" install=d2764927-9f79-4ac4-94e3-afd9f4739169 run=e9a891d4-4994-48fa-84ba-61ab66af4d4a
time=2025-01-20T14:44:28.109Z level=INFO msg="Starting watch for config changes in K8s"
time=2025-01-20T14:44:33.912Z level=DEBUG msg="Switch state updated" took=5.749965271s
time=2025-01-20T14:44:34.040Z level=DEBUG msg="Sending heartbeat" name=s5248-05
time=2025-01-20T14:44:39.766Z level=DEBUG msg="Switch state updated" took=5.726302612s
time=2025-01-20T14:44:39.900Z level=DEBUG msg="Sending heartbeat" name=s5248-05
time=2025-01-20T14:44:45.828Z level=DEBUG msg="Switch state updated" took=5.927379521s
time=2025-01-20T14:44:45.911Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-05\": th
e object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T14:44:52.275Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T14:44:52.311Z level=INFO msg="Config loaded from file" name=s5248-05 gen=3 res=10712
time=2025-01-20T14:44:52.433Z level=INFO msg="Processing agent config" name=s5248-05 gen=3 res=10712
time=2025-01-20T14:44:52.434Z level=INFO msg="Control link configuration applied"
time=2025-01-20T14:44:52.794Z level=DEBUG msg="Desired state generated"
time=2025-01-20T14:45:01.047Z level=DEBUG msg="Actual state loaded" took=8.252908701s
time=2025-01-20T14:45:01.049Z level=DEBUG msg="Actions calculated" count=19
time=2025-01-20T14:45:01.077Z level=DEBUG msg="Actual <> Desired" diff="--- Actual State"
time=2025-01-20T14:45:01.077Z level=DEBUG msg="Actual <> Desired" diff="+++ Desired State"

Does this look like any known issue @Frostman ?

@edipascale
Copy link
Contributor

I just saw this exact same issue in env-1 while testing the VRF scaling, for what it's worth:

admin@s5248-03:~$ grep -C 3 "has been modified" /var/log/agent.log
time=2025-01-20T16:37:30.491Z level=DEBUG msg="Switch state updated" took=57.59832286s
time=2025-01-20T16:37:30.764Z level=DEBUG msg="Sending heartbeat" name=s5248-03
time=2025-01-20T16:38:28.449Z level=DEBUG msg="Switch state updated" took=57.68468957s
time=2025-01-20T16:38:28.678Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-03\": the object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T16:38:35.124Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T16:38:35.257Z level=INFO msg="Config loaded from file" name=s5248-03 gen=278 res=1883321
time=2025-01-20T16:38:35.424Z level=INFO msg="Processing agent config" name=s5248-03 gen=278 res=1883321
--
time=2025-01-20T16:44:26.493Z level=DEBUG msg="Switch state updated" took=58.052304011s
time=2025-01-20T16:44:26.853Z level=DEBUG msg="Sending heartbeat" name=s5248-03
time=2025-01-20T16:45:23.611Z level=DEBUG msg="Switch state updated" took=56.757426164s
time=2025-01-20T16:45:23.780Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-03\": the object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T16:45:30.096Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T16:45:30.185Z level=INFO msg="Config loaded from file" name=s5248-03 gen=278 res=1883321
time=2025-01-20T16:45:30.338Z level=INFO msg="Processing agent config" name=s5248-03 gen=278 res=1883321

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 27, 2025

18:38:17 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-8\": running hhnet \"vlan 1008 enp2s1\": Process exited with status 1: out: Failed to get IP address for enp2s1.1008\n"

First hit with show-tech captured:
https://github.com/githedgehog/fabricator/actions/runs/12993975054/job/36237967537?pr=349

show-tech-hlab-spine-leaf-true-iso.zip

@pau-hedgehog
Copy link
Contributor Author

show-tech-hlab-spine-leaf-true-iso.zip

	=== systemd-networkd logs ===
Jan 27 18:13:02 localhost systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 27 18:13:02 localhost systemd-networkd[748]: lo: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: lo: Gained carrier
Jan 27 18:13:02 localhost systemd-networkd[748]: Enumeration completed
Jan 27 18:13:02 localhost systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Gained carrier
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Gained carrier
Jan 27 18:13:03 localhost systemd-networkd[748]: enp2s0: DHCPv4 address 172.31.8.10/24 acquired from 172.31.8.2
Jan 27 18:13:03 localhost systemd-networkd[748]: enp2s0: Gained IPv6LL
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s1: Gained IPv6LL
Jan 27 18:13:04 localhost systemd[1]: Stopping systemd-networkd.service - Network Configuration...
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s1: DHCPv6 lease lost
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s0: DHCPv6 lease lost
Jan 27 18:13:04 localhost systemd[1]: systemd-networkd.service: Deactivated successfully.
Jan 27 18:13:04 localhost systemd[1]: Stopped systemd-networkd.service - Network Configuration.
Jan 27 18:13:07 server-8 systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 27 18:13:08 server-8 systemd-networkd[1338]: lo: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: lo: Gained carrier
Jan 27 18:13:08 server-8 systemd-networkd[1338]: Enumeration completed
Jan 27 18:13:08 server-8 systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Configuring with /etc/systemd/network/99-lldp.network.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Configuring with /etc/systemd/network/99-lldp.network.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Gained carrier
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Gained carrier
Jan 27 18:13:09 server-8 systemd-networkd[1338]: enp2s0: DHCPv4 address 172.31.8.10/24 acquired from 172.31.8.2
Jan 27 18:13:10 server-8 systemd-networkd[1338]: enp2s0: Gained IPv6LL
Jan 27 18:13:10 server-8 systemd-networkd[1338]: enp2s1: Gained IPv6LL
Jan 27 18:25:37 server-8 systemd-networkd[1338]: enp2s1: Lost carrier
Jan 27 18:25:37 server-8 systemd-networkd[1338]: enp2s1: DHCPv6 lease lost
Jan 27 18:36:34 server-8 systemd-networkd[1338]: enp2s1: Link DOWN
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1: Link UP
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1.1008: Link UP
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1.1008: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:35:46.411363+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- doVlanTask: Vlan Vlan1008 already present
Jan 27 18:35:46.411384+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- doVlanTask: autostate enable vl.m_autostate enable
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface Vlan1008 MTU to 9100
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- findVlanOperState: admin_state UP vl.m_autostate enable up_mmbr_cnt 0 isUp DOWN
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- updateDbVlanOperStatus: vlan Vlan1008 status down
Jan 27 18:35:46.411743+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: Intf Vlan1008 mac 00:00:00:11:11:11 op: SET
Jan 27 18:35:46.411763+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: STATE VLAN TBL Vlan1008 oper down
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- ifChangeInformNextHop: ifChangeInformNextHop Vlan1008 up 0
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- flushFdbByPort: flushFdbByPort for Vlan1008 ; flush_static:0 flush_mclag/evpn mh 
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: Intf Vlan1008 mac 00:00:00:11:11:11 op: SET
Jan 27 18:35:46.412581+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: STATE VLAN TBL Vlan1008 oper down
Jan 27 18:35:46.412603+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- IntfTrackOrch:IntfTrackGroup:update:105 PORT_OPER_STATE_CHANGE Port:Vlan1008 Up:0
Jan 27 18:35:46.412829+00:00 2025 ds3000-03 NOTICE swss#vlanmgrd: :- adjustVlanMtu: No change in MTU of Vlan1008 required, it continues to be oper_mtu = 9036 cfg_mtu = 9100 new_mtu=9100
Jan 27 18:35:46.412829+00:00 2025 ds3000-03 NOTICE swss#vlanmgrd: :- doVlanStateNotificationTask: vlanmgr received port status notification state down vlan Vlan1008 id 1008

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 28, 2025

I hit this on env-3:

11:33:22 ERR setup-vpcs: configuring servers: configuring server "server-3": running hhnet "vlan 1003 enp2s1": Process exited with status 1: out: Failed to get IP address for enp2s1.1003

I notice the link going down and no IP after recovering:

core@server-3 ~ $ networkctl status
● Interfaces: 1, 2, 4, 3
       State: routable                                
Online state: online                                  
     Address: 172.31.4.10 on enp2s0
              fe80::e20:12ff:fefe:400 on enp2s0
              fe80::faf2:1eff:fe2a:21ac on enp2s1
              fe80::faf2:1eff:fe2a:21ac on enp2s1.1003

Jan 28 11:32:16 server-3 systemd-networkd[1256]: enp2s1: Link DOWN
Jan 28 11:32:16 server-3 systemd-networkd[1256]: enp2s1: Lost carrier
Jan 28 11:32:16 server-3 systemd-networkd[1256]: enp2s1: DHCPv6 lease lost
Jan 28 11:32:21 server-3 systemd-networkd[1256]: enp2s1: Link UP
Jan 28 11:32:21 server-3 systemd-networkd[1256]: enp2s1.1003: Link UP
Jan 28 11:32:21 server-3 systemd-networkd[1256]: enp2s1.1003: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 28 11:32:22 server-3 systemd-networkd[1256]: enp2s1: Gained carrier
Jan 28 11:32:22 server-3 systemd-networkd[1256]: enp2s1.1003: Gained carrier
Jan 28 11:32:23 server-3 systemd-networkd[1256]: enp2s1.1003: Gained IPv6LL
Jan 28 11:32:23 server-3 systemd-networkd[1256]: enp2s1: Gained IPv6LL
core@server-3 ~ $ ip addr show enp2s1.1003
4: enp2s1.1003@enp2s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether f8:f2:1e:2a:21:ac brd ff:ff:ff:ff:ff:ff
    inet6 fe80::faf2:1eff:fe2a:21ac/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever

Investigating the upstream switch:

core@server-3 ~ $ date
Tue Jan 28 11:39:32 UTC 2025
core@server-3 ~ $ networkctl lldp
LINK   CHASSIS-ID        SYSTEM-NAME CAPS        PORT-ID   PORT-DESCRIPTION                                 
enp2s1 68:21:5f:a6:60:72 as7326-01   ....r...... Ethernet3 Unbundled server-3 server-3--unbundled--as7326-01

I see a series of logs when Ethernet3 goes down:

Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- onPortStateChange: Enqueue port[oid:0x100000008] state [down] notification
Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to syncd: port_state_change [{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to OA: port_state_change [{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.645033+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.645131+00:00 2025 as7326-01 NOTICE syncd#syncd: :- ntf_process_function: Dequeued port event from priority queue: event (port_state_change) data([{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_DOW
N"}])
Jan 28 11:32:16.645187+00:00 2025 as7326-01 NOTICE syncd#syncd: :- process_on_port_state_change: Syncd port_state_change:[{"port_id":"oid:0x1000000000008","port_state":"SAI_PORT_OPER_STATUS_DOWN"}] Port VID oid:0x1000000000008 st
ate change notification oper-state:2
Jan 28 11:32:16.645585+00:00 2025 as7326-01 NOTICE swss#orchagent: :- notificationThreadFunction: Port state change notification consumer: op = port_state_change, data= [{"port_id":"oid:0x1000000000008","port_state":"SAI_PORT_OPE
R_STATUS_DOWN"}]
Jan 28 11:32:16.645585+00:00 2025 as7326-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000008 status:2
Jan 28 11:32:16.645585+00:00 2025 as7326-01 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet3 oper state set from up to down
Jan 28 11:32:16.647466+00:00 2025 as7326-01 NOTICE eventd#eventd[25]: [EVENT], %INTERFACE_OPER_STATUS: %%Ethernet3:  Oper state changed from up to down
Jan 28 11:32:16.649385+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:32:16.649385+00:00 2025 as7326-01 NOTICE swss#orchagent: :- flushFdbByPort: flushFdbByPort for Ethernet3 ; flush_static:0 flush_mclag/evpn mh 
Jan 28 11:32:16.649385+00:00 2025 as7326-01 NOTICE swss#orchagent: :- updatePortOperStatus: Vlan Vlan1003 Port Ethernet3 state down m_up_member_count 1
Jan 28 11:32:16.649438+00:00 2025 as7326-01 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet3
Jan 28 11:32:16.649475+00:00 2025 as7326-01 NOTICE swss#orchagent: :- ifChangeInformNextHop: ifChangeInformNextHop Ethernet3 up 0
Jan 28 11:32:16.649696+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes

And up:

Jan 28 11:32:17.156774+00:00 2025 as7326-01 NOTICE syncd#syncd: :- onPortStateChange: Enqueue port[oid:0x100000008] state [up] notification
Jan 28 11:32:17.156774+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.156774+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to syncd: port_state_change [{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.156774+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.156774+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to OA: port_state_change [{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.156871+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.156871+00:00 2025 as7326-01 NOTICE syncd#syncd: :- ntf_process_function: Dequeued port event from priority queue: event (port_state_change) data([{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"
}])
Jan 28 11:32:17.156871+00:00 2025 as7326-01 NOTICE syncd#syncd: :- process_on_port_state_change: Syncd port_state_change:[{"port_id":"oid:0x1000000000008","port_state":"SAI_PORT_OPER_STATUS_UP"}] Port VID oid:0x1000000000008 stat
e change notification oper-state:1
Jan 28 11:32:17.157039+00:00 2025 as7326-01 NOTICE swss#orchagent: :- notificationThreadFunction: Port state change notification consumer: op = port_state_change, data= [{"port_id":"oid:0x1000000000008","port_state":"SAI_PORT_OPE
R_STATUS_UP"}]
Jan 28 11:32:17.157039+00:00 2025 as7326-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000008 status:1
Jan 28 11:32:17.157055+00:00 2025 as7326-01 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet3 oper state set from down to up
Jan 28 11:32:17.157990+00:00 2025 as7326-01 NOTICE eventd#eventd[25]: [EVENT], %INTERFACE_OPER_STATUS: %%Ethernet3:  Oper state changed from down to up
Jan 28 11:32:17.158271+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000008","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.158271+00:00 2025 as7326-01 NOTICE syncd#syncd: [none] SAI_API_PORT:_brcm_sai_port_state_change_fast_convergence:20864 Port state change fast convergence: Port: 8, state: UP, port_id 100000008
Jan 28 11:32:17.158271+00:00 2025 as7326-01 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:_brcm_sai_port_state_change_nh_update:1604 Port 8 Type 0 Status Up
Jan 28 11:32:17.163836+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:32:17.163836+00:00 2025 as7326-01 NOTICE swss#orchagent: :- updatePortOperStatus: Vlan Vlan1003 Port Ethernet3 state up m_up_member_count 2
Jan 28 11:32:17.164101+00:00 2025 as7326-01 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status UP to host interface Ethernet3
Jan 28 11:32:17.164101+00:00 2025 as7326-01 NOTICE swss#orchagent: :- ifChangeInformNextHop: ifChangeInformNextHop Ethernet3 up 1
Jan 28 11:32:17.165351+00:00 2025 as7326-01 NOTICE iccpd#iccpd: [iccp_event_handler_obj_input_newlink.NOTICE]  Update local port Ethernet3 admin state up
Jan 28 11:32:17.165968+00:00 2025 as7326-01 NOTICE iccpd#iccpd: [iccp_event_handler_obj_input_newlink.NOTICE]  Update local port Ethernet3 state up
Jan 28 11:32:17.167826+00:00 2025 as7326-01 NOTICE kernel: [15325.194702] bkn_proc_link_write: Set operation status UP to host interface Ethernet3
Jan 28 11:32:17.172270+00:00 2025 as7326-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet3 admin:1 oper:0 addr:68:21:5f:a6:60:75 ifindex:18 master:8
Jan 28 11:32:17.172270+00:00 2025 as7326-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet3 admin:1 oper:1 addr:68:21:5f:a6:60:75 ifindex:18 master:8
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- onPortStateChange: Enqueue port[oid:0x100000057] state [up] notification
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to syncd: port_state_change [{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to OA: port_state_change [{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}]
Jan 28 11:32:17.415539+00:00 2025 as7326-01 NOTICE syncd#syncd: :- ntf_process_function: Dequeued port event from priority queue: event (port_state_change) data([{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_UP"}])
Jan 28 11:32:17.415658+00:00 2025 as7326-01 NOTICE syncd#syncd: :- process_on_port_state_change: Syncd port_state_change:[{"port_id":"oid:0x1000000000ab1","port_state":"SAI_PORT_OPER_STATUS_UP"}] Port VID oid:0x1000000000ab1 state change notification oper-state:1
Jan 28 11:32:17.416295+00:00 2025 as7326-01 NOTICE swss#orchagent: :- notificationThreadFunction: Port state change notification consumer: op = port_state_change, data= [{"port_id":"oid:0x1000000000ab1","port_state":"SAI_PORT_OPER_STATUS_UP"}]

I see other ports going down in the switch log:

Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to syncd: port_state_change [{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueue: Port state change Enqueued in Notification queue:[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- enqueueNotification: Enqueued port state change to OA: port_state_change [{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- ntf_process_function: Dequeued port event from priority queue: event (port_state_change) data([{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOW
N"}])
Jan 28 11:32:16.903635+00:00 2025 as7326-01 NOTICE syncd#syncd: :- process_on_port_state_change: Syncd port_state_change:[{"port_id":"oid:0x1000000000ab1","port_state":"SAI_PORT_OPER_STATUS_DOWN"}] Port VID oid:0x1000000000ab1 st
ate change notification oper-state:2
Jan 28 11:32:16.904136+00:00 2025 as7326-01 NOTICE swss#orchagent: :- notificationThreadFunction: Port state change notification consumer: op = port_state_change, data= [{"port_id":"oid:0x1000000000ab1","port_state":"SAI_PORT_OPE
R_STATUS_DOWN"}]
Jan 28 11:32:16.904136+00:00 2025 as7326-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000ab1 status:2
Jan 28 11:32:16.904361+00:00 2025 as7326-01 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet52 oper state set from up to down
Jan 28 11:32:16.905685+00:00 2025 as7326-01 NOTICE eventd#eventd[25]: [EVENT], %INTERFACE_OPER_STATUS: %%Ethernet52:  Oper state changed from up to down
Jan 28 11:32:16.906737+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet52-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:32:16.906737+00:00 2025 as7326-01 NOTICE swss#orchagent: :- flushFdbByPort: flushFdbByPort for Ethernet52 ; flush_static:0 flush_mclag/evpn mh 
Jan 28 11:32:16.906967+00:00 2025 as7326-01 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet52
Jan 28 11:32:16.907035+00:00 2025 as7326-01 NOTICE swss#orchagent: :- ifChangeInformNextHop: ifChangeInformNextHop Ethernet52 up 0
Jan 28 11:32:16.907524+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet52-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:32:16.908369+00:00 2025 as7326-01 NOTICE syncd#syncd: :- tryDequeue: Port state change Dequed for :[{"port_id":"oid:0x100000057","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]
Jan 28 11:32:16.908369+00:00 2025 as7326-01 NOTICE syncd#syncd: [none] SAI_API_PORT:_brcm_sai_port_state_change_fast_convergence:20864 Port state change fast convergence: Port: 87, state: DOWN, port_id 100000057
Jan 28 11:32:16.908935+00:00 2025 as7326-01 NOTICE iccpd#iccpd: [iccp_event_handler_obj_input_newlink.NOTICE]  Update local port Ethernet52 state down
Jan 28 11:32:16.909783+00:00 2025 as7326-01 NOTICE syncd#syncd: [none] SAI_API_PORT:_brcm_sai_port_state_change_fast_convergence:20881 Port fast convergence, updated knet interface: Ethernet52
Jan 28 11:32:16.909783+00:00 2025 as7326-01 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:_brcm_sai_port_state_change_nh_update:1604 Port 87 Type 0 Status Down
Jan 28 11:32:16.909783+00:00 2025 as7326-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet52 admin:1 oper:0 addr:68:21:5f:a6:60:75 ifindex:72 master:80
Jan 28 11:32:16.910891+00:00 2025 as7326-01 NOTICE iccpd#iccpd: [iccp_event_handler_obj_input_newlink.NOTICE]  Update local port Ethernet52 admin state down
Jan 28 11:32:16.911179+00:00 2025 as7326-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet52 admin:0 oper:0 addr:68:21:5f:a6:60:75 ifindex:72 master:80
Jan 28 11:32:16.911836+00:00 2025 as7326-01 NOTICE kernel: [15324.937983] bkn_proc_link_write: Set operation status DOWN urgent to host interface Ethernet52
Jan 28 11:32:16.911855+00:00 2025 as7326-01 NOTICE kernel: [15324.940863] bkn_proc_link_write: Set operation status DOWN urgent to host interface Ethernet52

@Frostman are you aware of any SONiC issue like this? or should we inspect the lab cabling/NICs. Local fault usually indicates loss of signal detected on the receive data path of a local port:

Jan 28 07:29:30.286152+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 07:29:30.290520+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 07:29:30.845875+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 07:29:35.658626+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 07:29:35.659611+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 07:29:36.186817+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:22:32.031489+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:22:32.038241+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:22:58.259932+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:22:59.041667+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:22:59.042150+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:24:10.824784+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:24:18.725538+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:24:18.726351+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:24:19.247898+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:24:23.364117+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:24:23.364674+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:24:23.888284+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:32:16.649385+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:32:16.649696+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:32:17.163836+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes
Jan 28 11:32:21.781043+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-phy-link-down update_reason: Yes
Jan 28 11:32:21.781875+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-PHY_LINK_DOWN-local-fault update_reason: Yes
Jan 28 11:32:22.309583+00:00 2025 as7326-01 NOTICE swss#orchagent: :- addEvent: Interface event Ethernet3-OPER_UP-phy-link-up update_reason: Yes

@edipascale
Copy link
Contributor

@pau-hedgehog
Copy link
Contributor Author

one more hit: https://github.com/githedgehog/fabricator/actions/runs/13013086819/job/36295608627?pr=338

Thanks, @edipascale . I'm getting this consistently in env-3, so I'm trying to improve hhnet script while working on another PR

@pau-hedgehog
Copy link
Contributor Author

OK. I took a look at the hhnet script and it does a ip link down in the cleanup function:

function cleanup() {
    for i in {0..3}; do
        sudo ip l d "bond$i" 2> /dev/null || true
    done

    for i in {1..9}; do
        sudo ip l s "enp2s$i" down 2> /dev/null || true
        for j in {1000..1020}; do
            sudo ip l d "enp2s$i.$j" 2> /dev/null || true
        done
    done

    sleep 1
}

I refactored hhnet to use networkctl and the first impression is it's a lot more stable. I'm facing some other issues in env-3 to test this, at the moment

@pau-hedgehog pau-hedgehog linked a pull request Jan 29, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants