-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observed NT4 rapid connect/disconnect cycles during competition match #7847
Comments
What version of WPILib on the robot code and dashboard? On the client, there's a 1 second timeout--if the client hasn't received any messages in 1 second, it disconnects. The client sends a ping message every 200 ms and the server responds to it with a pong message. This would result in a 1 second connection loop, which doesn't match what you're seeing. So it's not a timeout, but rather the client disconnecting for some other reason. Unfortunately the server (robot) log isn't that helpful here, as it doesn't tell us why the client closed the connection--that will only be discoverable from any error messages being output by the client. The fact it happens with Elastic as well (which is a completely independent client implementation) indicates there's probably something wrong with some of the data being sent by the server, but it's hard to tell what. A wireshark capture is probably needed to debug further. |
The robot code and shuffleboard were 2025.3.1 I agree, having error information from the client would be most helpful. But without radios configured as they are on the field I don't know how to try to reproduce this in an environment where we'd be able to see the client error messages. The regularity of the disconnection after 1/2 second and the reconnection after 1/2 second suggests strongly that there is a timeout involved somewhere. I don't think that if it were merely a matter of getting unexpected data that the timing would be that regular. Does NT4 limit its TCP segment sizes to the MTU or does it rely on IP fragmentation? I'm inferring from the data from the wpilog the following (in the absence of any tcpdump/wireshark data): If bandwidth limits are involved it seems like the only thing that would be affected is the first real data sent from server to client. Is that likely to be large and spread across multiple TCP segments and/or IP fragments? |
Yes, there's generally a fairly large initial burst of data on initial connection that depends on the number of topics created (a team using akit is more likely to have a large amount of logging / number of topics), as dashboard clients typically do a "subscribe all" on connection that results in the server sending announcement messages for every topic that exists on the server. In the case of some dashboards (shuffleboard and SmartDashboard) this also results in all current values being sent for every topic that exists on the server. The networking implementation and code path is quite different for SmartDashboard as it's an NT3 client rather than NT4, but the quantity of data is likely to be fairly similar for both. There's fairly complex logic in NT4 to deal with backpressure from the network, so it's possible there's a bug in there somewhere that ends up creating an invalid WS frame if the link is sufficiently slow. Then when the client gets the invalid WS frame, it terminates the connection. I can do some more testing of this scenario. It's interesting that it's so close to exactly 500 ms on multiple connection attempts, although I guess with similar enough data rates, it could be pretty consistent? Also, NT3 doesn't have this logic, so that wouldn't explain SmartDashboard having a similar issue. |
Looking at the Vivid hosting site about the radios I can't tell where the bandwidth limit is imposed. There's one mention of Team AP mode having a 7mbit/s limit in an earlier release. Maybe that has persisted. Given the way programming with the kiosk worked at the event I suspect that the on-field limit is actually imposed by the access point there as well (and not the robot radio). So I will see what I can set up in the shop tomorrow evening and attempt to reproduce this. |
Yes, the bandwidth limit is currently implemented on the AP side. |
You would need a full FMS setup to reproduce this. The bandwidth limit is only enabled with the offseason AP firmware, which requires an FMS setup with external DHCP servers to work. Theres no method to enable the bandwidth limit with the normal AP firmware. |
That's unfortunate and hopefully is something that can be changed, if not
for this year then for next. Teams should be able to test in an environment
that is as close as possible to the competition environment and the
bandwidth limit seems like something that is pretty fundamental to test
(even apart from anything specific to this issue).
…On Sun, Mar 2, 2025 at 7:07 PM Thad House ***@***.***> wrote:
So I will see what I can set up in the shop tomorrow evening and attempt
to reproduce this
You would need a full FMS setup to reproduce this. The bandwidth limit is
only enabled with the offseason AP firmware, which requires an FMS setup
with external DHCP servers to work. Theres no method to enable the
bandwidth limit with the normal AP firmware.
—
Reply to this email directly, view it on GitHub
<#7847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGLUU5ZUDJ2TPUM22GY4J32SPBQ5AVCNFSM6AAAAABYF3RCTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJTGE3TENZZGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
[image: ThadHouse]*ThadHouse* left a comment (wpilibsuite/allwpilib#7847)
<#7847 (comment)>
So I will see what I can set up in the shop tomorrow evening and attempt
to reproduce this
You would need a full FMS setup to reproduce this. The bandwidth limit is
only enabled with the offseason AP firmware, which requires an FMS setup
with external DHCP servers to work. Theres no method to enable the
bandwidth limit with the normal AP firmware.
—
Reply to this email directly, view it on GitHub
<#7847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGLUU5ZUDJ2TPUM22GY4J32SPBQ5AVCNFSM6AAAAABYF3RCTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJTGE3TENZZGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Could it be the client websocket timeout that's driving this?
|
Potentially? Hard to verify without a log of the client side. We get the "CONNECTED NT4" message on the server side, which only happens after the handshake completes on the server side. That means it's put out onto the wire the HTTP response for switching protocols. If that response really does get delayed by half a second, yes, it would time out... but that's an extraordinary delay for a transmit of a few dozen bytes. |
I tried this morning to reproduce the problem using an old radio configured with a bandwidth limit and high bandwidth background traffic. Elastic connected without difficulty every time. Details
One thing about the story from the team bothers me: if the dashboard wouldn't start, how were they displaying the camera streams? They could have been connecting directly with a different program but I don't have confirmation of that. At this point I still think the most likely scenario is running into the bandwidth limit when on-field, but I also think there's a good possibility it is something else. Who has a setup that could attempt to reproduce the problem in a more field-like setting? As you've said, Peter, without the ability to reproduce it and see what's going on in the dashboard program it's hard to make progress. I don't think this is necessarily a bug -- if it is bandwidth related the cure is to reduce bandwidth; nothing that could be done in the code would make things good. But if the problem arises again it would be nice to be able to say with certainty "fix the bandwidth use and that will fix the problem." |
Camera streams are separate entities, they're just MJPEG streams from the RIO or PhotonVision, etc. They never go through NetworkTables. A user reported similar issues when using QFRCDashboard that I've never been able to replicate. It seems like it was NOT a competition only problem though. |
Ah, yes. If the camera stream widget was already on the dashboard you wouldn't need NT for the streaming to start. Thanks! |
Describe the bug

I helped a team this weekend that could get no dashboard displayed on-field during competition but it worked fine in the pits. Eventually looked at the wpilog and found what is shown in the attachment: the server repeatedly logs a shuffleboard connection followed 1/2 second later by a disconnection and then another connection a half second after that. This continued from before the match until it was over. Shuffledboard on the driverstation never showed that it was connected. The team reported that they had no luck with Elastic or SmartDashboard either.
A second team reported similar inability to get their dashboard to connect but I do not have a wpilog for them.
I learned that both teams were showing driver camera streams on their driverstation computers without any attempt to control the camera stream bandwidth. Both teams stated they could do without the camera streams and turned them off. They had no further problems with dashboard connection throughout the competition.
My best guess is that the on-field bandwidth limiting is triggering this behavior but I don't know enough about how either the bandwidth limiting works or how NT4 decides that a connection isn't viable (and therefore ends it) to be sure this is the reason. I don't have an adequate model that would explain how bandwidth-limiting would lead to this particular behavior of NT, but so far it is the only scenario I've managed to posit that matches the observations.
It would be helpful for CSAs to know if on-field bandwidth limiting is the cause of this behavior or if there is some other root cause that we should be looking for.
The text was updated successfully, but these errors were encountered: