Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observed NT4 rapid connect/disconnect cycles during competition match #7847

Open
chauser opened this issue Mar 3, 2025 · 12 comments
Open

Observed NT4 rapid connect/disconnect cycles during competition match #7847

chauser opened this issue Mar 3, 2025 · 12 comments
Labels
type: bug Something isn't working.

Comments

@chauser
Copy link
Contributor

chauser commented Mar 3, 2025

Describe the bug
I helped a team this weekend that could get no dashboard displayed on-field during competition but it worked fine in the pits. Eventually looked at the wpilog and found what is shown in the attachment: the server repeatedly logs a shuffleboard connection followed 1/2 second later by a disconnection and then another connection a half second after that. This continued from before the match until it was over. Shuffledboard on the driverstation never showed that it was connected. The team reported that they had no luck with Elastic or SmartDashboard either.

A second team reported similar inability to get their dashboard to connect but I do not have a wpilog for them.

I learned that both teams were showing driver camera streams on their driverstation computers without any attempt to control the camera stream bandwidth. Both teams stated they could do without the camera streams and turned them off. They had no further problems with dashboard connection throughout the competition.

My best guess is that the on-field bandwidth limiting is triggering this behavior but I don't know enough about how either the bandwidth limiting works or how NT4 decides that a connection isn't viable (and therefore ends it) to be sure this is the reason. I don't have an adequate model that would explain how bandwidth-limiting would lead to this particular behavior of NT, but so far it is the only scenario I've managed to posit that matches the observations.

It would be helpful for CSAs to know if on-field bandwidth limiting is the cause of this behavior or if there is some other root cause that we should be looking for.

Image

@chauser chauser added the type: bug Something isn't working. label Mar 3, 2025
@PeterJohnson
Copy link
Member

PeterJohnson commented Mar 3, 2025

What version of WPILib on the robot code and dashboard?

On the client, there's a 1 second timeout--if the client hasn't received any messages in 1 second, it disconnects. The client sends a ping message every 200 ms and the server responds to it with a pong message. This would result in a 1 second connection loop, which doesn't match what you're seeing. So it's not a timeout, but rather the client disconnecting for some other reason.

Unfortunately the server (robot) log isn't that helpful here, as it doesn't tell us why the client closed the connection--that will only be discoverable from any error messages being output by the client. The fact it happens with Elastic as well (which is a completely independent client implementation) indicates there's probably something wrong with some of the data being sent by the server, but it's hard to tell what. A wireshark capture is probably needed to debug further.

@chauser
Copy link
Contributor Author

chauser commented Mar 3, 2025

The robot code and shuffleboard were 2025.3.1

I agree, having error information from the client would be most helpful. But without radios configured as they are on the field I don't know how to try to reproduce this in an environment where we'd be able to see the client error messages. The regularity of the disconnection after 1/2 second and the reconnection after 1/2 second suggests strongly that there is a timeout involved somewhere. I don't think that if it were merely a matter of getting unexpected data that the timing would be that regular. Does NT4 limit its TCP segment sizes to the MTU or does it rely on IP fragmentation?

I'm inferring from the data from the wpilog the following (in the absence of any tcpdump/wireshark data):
Client attempts connection (TCP SYN)
Server replies with (SYN ACK)
Client replies with client info ("shuffleboard") and ACK
Server accepts the connection, prints the connection message, and attempts to send the first NT4 data
Client doesn't like what it has seen within 1/2 second and sends TCP RST
Server logs the disconnection
Repeat 1/2 second later

If bandwidth limits are involved it seems like the only thing that would be affected is the first real data sent from server to client. Is that likely to be large and spread across multiple TCP segments and/or IP fragments?

@PeterJohnson
Copy link
Member

PeterJohnson commented Mar 3, 2025

Yes, there's generally a fairly large initial burst of data on initial connection that depends on the number of topics created (a team using akit is more likely to have a large amount of logging / number of topics), as dashboard clients typically do a "subscribe all" on connection that results in the server sending announcement messages for every topic that exists on the server. In the case of some dashboards (shuffleboard and SmartDashboard) this also results in all current values being sent for every topic that exists on the server. The networking implementation and code path is quite different for SmartDashboard as it's an NT3 client rather than NT4, but the quantity of data is likely to be fairly similar for both.

There's fairly complex logic in NT4 to deal with backpressure from the network, so it's possible there's a bug in there somewhere that ends up creating an invalid WS frame if the link is sufficiently slow. Then when the client gets the invalid WS frame, it terminates the connection. I can do some more testing of this scenario. It's interesting that it's so close to exactly 500 ms on multiple connection attempts, although I guess with similar enough data rates, it could be pretty consistent?

Also, NT3 doesn't have this logic, so that wouldn't explain SmartDashboard having a similar issue.

@chauser
Copy link
Contributor Author

chauser commented Mar 3, 2025

Looking at the Vivid hosting site about the radios I can't tell where the bandwidth limit is imposed. There's one mention of Team AP mode having a 7mbit/s limit in an earlier release. Maybe that has persisted. Given the way programming with the kiosk worked at the event I suspect that the on-field limit is actually imposed by the access point there as well (and not the robot radio). So I will see what I can set up in the shop tomorrow evening and attempt to reproduce this.

@PeterJohnson
Copy link
Member

Yes, the bandwidth limit is currently implemented on the AP side.

@ThadHouse
Copy link
Member

So I will see what I can set up in the shop tomorrow evening and attempt to reproduce this

You would need a full FMS setup to reproduce this. The bandwidth limit is only enabled with the offseason AP firmware, which requires an FMS setup with external DHCP servers to work. Theres no method to enable the bandwidth limit with the normal AP firmware.

@chauser
Copy link
Contributor Author

chauser commented Mar 3, 2025 via email

@chauser
Copy link
Contributor Author

chauser commented Mar 3, 2025

Could it be the client websocket timeout that's driving this?

./cpp/NetworkClient.cpp:static constexpr uv::Timer::Time kWebsocketHandshakeTimeout{500};

@PeterJohnson
Copy link
Member

Potentially? Hard to verify without a log of the client side. We get the "CONNECTED NT4" message on the server side, which only happens after the handshake completes on the server side. That means it's put out onto the wire the HTTP response for switching protocols. If that response really does get delayed by half a second, yes, it would time out... but that's an extraordinary delay for a transmit of a few dozen bytes.

@chauser
Copy link
Contributor Author

chauser commented Mar 3, 2025

I tried this morning to reproduce the problem using an old radio configured with a bandwidth limit and high bandwidth background traffic. Elastic connected without difficulty every time. Details

  • Robot code running in simulation on a computer wired to the RIO port of the radio (10.40.61.93)
  • Elastic dashboard running on a computer wirelessly connected to the radio (10.40.61.236)
  • 3 scp streams of a gigantic file being sent from the .93 machine to the .236 machine
  • Ethernet performance monitor in task manager (on .236 machine) showed the wireless utilization slightly varying around 4mbps incoming

One thing about the story from the team bothers me: if the dashboard wouldn't start, how were they displaying the camera streams? They could have been connecting directly with a different program but I don't have confirmation of that.

At this point I still think the most likely scenario is running into the bandwidth limit when on-field, but I also think there's a good possibility it is something else. Who has a setup that could attempt to reproduce the problem in a more field-like setting? As you've said, Peter, without the ability to reproduce it and see what's going on in the dashboard program it's hard to make progress.

I don't think this is necessarily a bug -- if it is bandwidth related the cure is to reduce bandwidth; nothing that could be done in the code would make things good. But if the problem arises again it would be nice to be able to say with certainty "fix the bandwidth use and that will fix the problem."

@crueter
Copy link
Contributor

crueter commented Mar 5, 2025

One thing about the story from the team bothers me: if the dashboard wouldn't start, how were they displaying the camera streams? They could have been connecting directly with a different program but I don't have confirmation of that.

Camera streams are separate entities, they're just MJPEG streams from the RIO or PhotonVision, etc. They never go through NetworkTables.

A user reported similar issues when using QFRCDashboard that I've never been able to replicate. It seems like it was NOT a competition only problem though.

@chauser
Copy link
Contributor Author

chauser commented Mar 5, 2025

Ah, yes. If the camera stream widget was already on the dashboard you wouldn't need NT for the streaming to start. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

4 participants