-
Notifications
You must be signed in to change notification settings - Fork 44
What is causing the 500 socket hang up errors on native with test app? #1415
Comments
For testing purposes we should focus on https://github.com/czyzm/ThaliTestApp since it is closest to behaving like an actual customer application. Please configure that test app to run in native mode and set it up to use my master_add_logging branch which will give much more insight into what Node is doing. Match that by updating our Java code to make it very clear what is going on at the TCP layer. |
Fresh app. Didn't press add data button |
Today we were talking with @chapko and found a great result. We have 2 peers. One peer received error Peer will say Another peer's So we should look into another peer's log. In this log we have |
We have a typical log here. Peer So we have a bug: someone tried to send data for peer, that is not available on this socket. I will recompile jxcore tomorrow in order to provide a good error: |
I've tracked this packet. I've received the following results:
So we received some packet and tried to write it into socket, but socket has been already closed. Maybe our pipe don't want to be destroyed properly. |
I've marked sockets with autoincrement ids. Socket that throws
So I am sure that this socket is |
Who destroyed our incoming? It is
But our pipe ( So I've found the following steps:
|
So I am sure that we have a problem in I've fixed this bug and I couldn't receive |
We are receiving a segfault with spider monkey on desktop with 100% chance. We will test thali test app on phones again. |
@andrew-aladev I noticed that you are outputting content from data events. I wonder if you are running into thaliproject/jxcore#71? If you don't touch the buffers you get on the streams do you still get the segfault? |
@yaronyg I have just added simple recreate logic for ECONNRESET and ECONNREFUSED errors and on desktop everything seems fine but on devices I'm getting errors:
You can find it in the 1.log below. I put my changes into |
Both of the final logs return with no activity time outs. That is correct. So what is the problem with the last logs you gave above? |
I was running "test 1" from the thali test app (Andrew's branch). This test, I believe, should run infinitely, by it failed because of the test timeout (not "no activity" timeout) and exited with code 3. I'm not sure if this is a thali bug or a test bug. |
This can be possible connected with libuv issue. |
I've collected all required information about this issue.
But we couldn't see Result:
|
We've found another problem that produces a 500 error for us. BTW we've found a proper way to destroy piped sockets in
|
It will create
So we have an infinite loop. We couldn't cancel our replication. Why? There is a commit where
This comment maybe means that We need to find a way to break this infinite loop without exceptions. |
I suspect that when we are the ones who are responsible for closing things down then we need a flag somewhere on the replication action that tells us that so that when the replication exits with a 500 we can check the flag and realize that it's a meaningless failure and in fact we exited as expected. This flag should probably be hooked to both kill as well as to replication time out. |
I've found a reliable way to reproduce and understand this issue.
Than we can see the issue:
We will try to find a way to ignore this error with some flag for now. In future we need to upgrade our
|
I've finished today with debugging this issue. In I've checked that our We can add here:
This will produce:
So our peer connection was closed, because the remote peer closed connection. |
But why did the remote peer close the connection? Scenario: Why did Device B close the connection? There is literally no logic anywhere in Express-PouchDB that would close a connection on the server side. It only closes in response to the client closing a connection. So either the Bluetooth connection was lost (in which case you should see a nonTCPPeerAvailabilityChanged event in the logs with a null port not to mention error messages from the Android layer) or we have a bug in Express that is closing connections for some bizarre reason. |
Today we've found that jxcore special timers don't want to work with our timeout:
We will receive:
This 5000 goes from We can workaround this issue by using 10 days timeout instead of 10 years. This number |
I am going to provide a workaround for jxcore issue. |
In the native logs in #1356 which are:
We see a bunch of errors of the form:
10-24 15:07:21.567 8855 8909 I jxcore-log 2016-10-24 13:07:21 - DEBUG thaliReplicationPeerAction: 'Got error on replication - 500 socket hang up'
10-24 15:07:21.567 8855 8909 I jxcore-log
The 500 socket hang up on replication error means that the replication code opened up a remote connection to the other phone, the request was sent and the connection was closed without ever receiving a response from the other side. Unfortunately PouchDB treats this as a 500 series error and so the entire replication fails. Which isn't that big a deal because we will automatically retry when we get a beacon. Which the logs show we do and keep failing with the same 500 series error.
I haven't seen this particular error pattern on Wifi or the Wifi based native mock. So this argues pretty strongly that the problem has to do with the interaction of the Android native layer and Node. But that doesn't actually mean that's true. What we need to do is to trace the failure that made us go out.
In other words we have to show that:
Basically what we need to know is - why are we getting the failed connection on replication? And why does this failure only happen on replication and not on beacon retrieval?
The text was updated successfully, but these errors were encountered: