Skip to content
This repository has been archived by the owner on Jan 15, 2021. It is now read-only.

What is causing the 500 socket hang up errors on native with test app? #1415

Closed
yaronyg opened this issue Oct 25, 2016 · 31 comments
Closed

What is causing the 500 socket hang up errors on native with test app? #1415

yaronyg opened this issue Oct 25, 2016 · 31 comments
Assignees

Comments

@yaronyg
Copy link
Member

yaronyg commented Oct 25, 2016

In the native logs in #1356 which are:

S7 logs: https://gist.github.com/czyzm/5a1a916c25dfe9156005ce87e0cce1c8
Nexus logs: https://gist.github.com/czyzm/25b92d6ed662d5cb838e64eaedfdc2f9

We see a bunch of errors of the form:

10-24 15:07:21.567 8855 8909 I jxcore-log 2016-10-24 13:07:21 - DEBUG thaliReplicationPeerAction: 'Got error on replication - 500 socket hang up'
10-24 15:07:21.567 8855 8909 I jxcore-log

The 500 socket hang up on replication error means that the replication code opened up a remote connection to the other phone, the request was sent and the connection was closed without ever receiving a response from the other side. Unfortunately PouchDB treats this as a 500 series error and so the entire replication fails. Which isn't that big a deal because we will automatically retry when we get a beacon. Which the logs show we do and keep failing with the same 500 series error.

I haven't seen this particular error pattern on Wifi or the Wifi based native mock. So this argues pretty strongly that the problem has to do with the interaction of the Android native layer and Node. But that doesn't actually mean that's true. What we need to do is to trace the failure that made us go out.

In other words we have to show that:

  1. Phone A receives a BLE announcement of Phone B
  2. Phone A receives a peerAvailabilityChanged event in thaliMobileNativeWrapper.js and since the peer is available that code will call createPeerListener which will then call connectToRemotePeer which will call the native Connect call. We need to see all of this in the logs.
  3. On the Android side we need to see that the connect call happened, we need to see that the TCP/IP listener is created.
  4. Once we get the connect callback we will then call handleForwardConnection which will call multiplextoNativeListener which sets up everything on the Node end.
  5. At this point we are back in Android and we need to see that Android got an incoming TCP/IP connection. We need to see when data is received on the input stream and sent on the output stream. If the stream is terminated for any reason we need to know why! Did node close the TCP/IP connection? Did Android terminate the TCP/IP connection? In the node code I have instrumented every single place we call .end or .destroy so we can always see who initiated the closure. And we need this information for both incoming and outgoing connections via Android. When we get a Bluetooth server socket we relay that to a native TCP client. We need to show when that client is created. We need to show that it created a socket successfully with node. We need to show that data is flowing across that socket. If the socket is torn down we need to prove who did it. Did node terminate the socket on the server end? Did Android terminate the client socket because of a Bluetooth failure? What happened?

Basically what we need to know is - why are we getting the failed connection on replication? And why does this failure only happen on replication and not on beacon retrieval?

@yaronyg
Copy link
Member Author

yaronyg commented Oct 25, 2016

For testing purposes we should focus on https://github.com/czyzm/ThaliTestApp since it is closest to behaving like an actual customer application. Please configure that test app to run in native mode and set it up to use my master_add_logging branch which will give much more insight into what Node is doing. Match that by updating our Java code to make it very clear what is going on at the TCP layer.

@evabishchevich
Copy link
Member

@evabishchevich
Copy link
Member

evabishchevich commented Oct 27, 2016

@yaronyg yaronyg assigned yaronyg and unassigned evabishchevich Oct 28, 2016
@andrew-aladev
Copy link
Contributor

Today we were talking with @chapko and found a great result. We have 2 peers. One peer received error had error Error: Channel destroyed. Why? We've found this place.

Peer will say Channel destroyed when it received packet 0x04 0x00 0x06 0x00. 4 is just a packet length and 6 is the error code. The meaning of this packet is remote error. Another peer sent this packet.

Another peer's multiplex will send 6 error only if this peer is a initiator of destroy. So Channel destroyed means another peer destroyed our channel.

So we should look into another peer's log. In this log we have incoming (TCP) - Node TCP/IP client <-> Mux stream- ... - 0 - error: Error: This socket is closed.. This is a reason. We will continue investigating this error.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 1, 2016

We have a typical log here.

Peer 40baf507-afa8-42b6-b9bd-be0adb7787af become available and than not available. Than it appears available with new id 12a3cece-daba-48d4-b7e6-33f483410c1b. Our log This socket is closed is related to the old socket. Some code is going to write into dead socket.

So we have a bug: someone tried to send data for peer, that is not available on this socket. I will recompile jxcore tomorrow in order to provide a good error: This socket is closed, someone tried to send data: %s to us.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 2, 2016

I've tracked this packet. I've received the following results:

tcp onread, chunk: '^A]^V^C^C^@O^B^@^@K^C^CPk^�5��<U+B6625>#�&��\^\�ǹV�^NF2^R^U�dhi^@^@�^@^@#^@^@^@^@�^A^@^A^@^@#^@^@3t^@^R^Hhttp/1.http/1.0^V^C^C^@^D^N^@^@^@', stack: 'Error
    at TCP.onread (net.js:454:11)'
2016-11-02 12:24:40 - DEBUG createPeerListener: 'incoming (TCP) - Node TCP/IP client <-> Mux stream- urn:uuid:e5629472-127a-47de-bc3a-86fd189e4f81 - 0 - error: Error: This socket is closed, data: '^V^C^C^@O^B^@
^@K^C^CPk^�5��<U+B6625>#�&��\^\�ǹV�^NF2^R^U�dhi^@^@�^@^@#^@^@^@^@�^A^@^A^@^@#^@^@3t^@^R^Hhttp/1.http/1.0^V^C^C^@^D^N^@^@^@''
socket write, chunk: '^A]^V^C^C^@O^B^@^@K^C^Cu �^Y3q�^B#Z1Z<B�G�rm���M�]ESCz��f^XL^@^@�^@^@#^@^@^@^@�^A^@^A^@^@#^@^@3t^@^R^Hhttp/1.http/1.0^V^C^C^@^D^N^@^@^@', stack: 'Error
    at write (_stream_readable.js:559:13)
    at flow (_stream_readable.js:570:7)
    at Socket.pipeOnReadable (_stream_readable.js:600:5)
    at Socket.emit (events.js:79:17)
    at emitReadable_ (_stream_readable.js:390:10)
    at emitReadable (_stream_readable.js:386:5)
    at readableAddChunk (_stream_readable.js:144:31)
    at Socket.Readable.push (_stream_readable.js:108:10)
    at TCP.onread (net.js:487:18)

So we received some packet and tried to write it into socket, but socket has been already closed. Maybe our pipe don't want to be destroyed properly.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 2, 2016

I've marked sockets with autoincrement ids. Socket that throws This socket is closed was created (from lib/net.js) by:

Server.prototype._listen2 = function(address, port, addressType, backlog, fd) {
  ...
  self._handle.onconnection = onconnection;
  ...
}

function onconnection(clientHandle) {
 ...
  var socket = new Socket({
    handle: clientHandle,
    allowHalfOpen: self.allowHalfOpen
  });
  ...
  self.emit('connection', socket);
}

So I am sure that this socket is incoming.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 2, 2016

Who destroyed our incoming? It is makeIntoCloseAllServer. So the typical log is:

2016-11-02 13:54:30 - DEBUG thaliWifiInfrastructure: 'Emiting wifiPeerAvailabilityChanged {"peerIdentifier":"urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89","portNumber":null,"hostAddress":null}'
2016-11-02 13:54:30 - DEBUG thaliMobileNativeWrapper: 'Received peerAvailabilityChanged event from native layer [{"peerIdentifier":"urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89","peerAvailable":false,"pleaseCon
nect":false}]'
2016-11-02 13:54:30 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89","peerAvailable":false,"pleaseConnect":false}
'
2016-11-02 13:54:30 - DEBUG thaliMobileNativeWrapper: 'handlePeerUnavailable - Emitting {"peerIdentifier":"urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89","portNumber":null}'
2016-11-02 13:54:30 - DEBUG thaliMobile: 'Emitting peerAvailabilityChanged from handlePeer {"peerIdentifier":"urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89","hostAddress":null,"portNumber":null,"connectionType":
"AndroidBluetooth","suggestedTCPTimeout":5000}'
2016-11-02 13:54:30 - DEBUG thaliPeerPoolOneAtATime: 'action returned successfully from start - Action ID: 1, Action Type: GetRequestBeacon, Connection Type: AndroidBluetooth, Peer Identifier: urn:uuid:35e4c552-
9814-4bff-8bef-eb720221ab89'
2016-11-02 13:54:30 - DEBUG thaliTcpServersManager: 'Terminate outgoing connection called on peerID urn:uuid:35e4c552-9814-4bff-8bef-eb720221ab89 with port 62320'
2016-11-02 13:54:30 - DEBUG makeIntoCloseAllServer: 'closeAll called on server'
2016-11-02 13:54:30 - DEBUG createPeerListener: 'ololo incoming destroyed, stack: 'Error
    at Socket.incoming._destroy (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/mux/createPeerListener.js:511:17)
    at Socket.destroy (net.js:448:8)
    at /Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/makeIntoCloseAllServer.js:69:18
    at Array.forEach (native)
    at Server.server.closeAll (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/makeIntoCloseAllServer.js:68:17)
    at Object.closeServer (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/mux/createPeerListener.js:25:10)
    at ThaliTcpServersManager.terminateOutgoingConnection (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/mux/thaliTcpServersManager.js:293:26)
    at /Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/NextGeneration/thaliPeerPool/thaliPeerPoolOneAtATime.js:198:12
    at Array.0 (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/node_modules/lie/lib/index.js:88:21)
    at nextTick (/Users/hiqouser/workspace/aaladev_1407/Thali_CordovaPlugin/thali/node_modules/immediate/lib/index.js:61:18)
    at process._tickCallback (node.js:924:13)''

But our pipe (incomingStream.pipe(incoming).pipe(incomingStream);) is still alive and we are trying to send data from incomingStream into dead incoming.

So I've found the following steps:

  1. Peer become unavailable.
  2. Our server was closed by using closeAll method.
  3. Our server socket was closed and it said I was closed, ok.
  4. Server had a not empty list of sockets. These sockets are our incoming sockets. Server destroyed all sockets.
  5. We didn't received any event from any incoming socket. Our pipe is still alive, but it shouldn't.
  6. We are receiving error on incoming and only than we are destroying incomingStream.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 2, 2016

So I am sure that we have a problem in function onNewConnection(incoming) {. We need to subscribe to incoming.on('error finish close etc' immediately without waiting. Otherwise we will loose these events on incoming. So now we are subscribing to destroyed socket.

I've fixed this bug and I couldn't receive this socket is closed anymore. But I am receiving socket hang up again. So this bug is not a reason of hang up. I will create a separate issue for this bug.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 2, 2016

We are receiving a segfault with spider monkey on desktop with 100% chance. We will test thali test app on phones again.

@yaronyg
Copy link
Member Author

yaronyg commented Nov 2, 2016

@andrew-aladev I noticed that you are outputting content from data events. I wonder if you are running into thaliproject/jxcore#71? If you don't touch the buffers you get on the streams do you still get the segfault?

@chapko
Copy link
Contributor

chapko commented Nov 3, 2016

@yaronyg I have just added simple recreate logic for ECONNRESET and ECONNREFUSED errors and on desktop everything seems fine but on devices I'm getting errors:

11-03 21:09:43.353 14928 14975 I jxcore-log: 2016-11-03 18:09:43 - WARN createPeerListener: 'callNative for BC:75:74:3A:BC:B7-19 failed with Connection to peer [<no peer name> BC:75:74:3A:BC:B7 18] failed: Failed to connect (tried 1 time(s)): read failed, socket might closed or timeout, read ret: -1'

You can find it in the 1.log below.

I put my changes into master_yarong_1446-retries-fix branch

@chapko
Copy link
Contributor

chapko commented Nov 3, 2016

I've added some fixes. Here are logs (ran on 06bee8b commit)

I'm going to investigate it tommorrow

@yaronyg
Copy link
Member Author

yaronyg commented Nov 3, 2016

Both of the final logs return with no activity time outs. That is correct. So what is the problem with the last logs you gave above?

@chapko
Copy link
Contributor

chapko commented Nov 3, 2016

I was running "test 1" from the thali test app (Andrew's branch). This test, I believe, should run infinitely, by it failed because of the test timeout (not "no activity" timeout) and exited with code 3. I'm not sure if this is a thali bug or a test bug.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 8, 2016

This can be possible connected with libuv issue.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 9, 2016

I've collected all required information about this issue.

  1. We have a duplex pipe (for example incoming.pipe(incomingStream).pipe(incoming)).
  2. We are destroying socket from pipe (for example incoming).
  3. We are calling incoming.destroy.
  4. incoming emits finish event.
  5. We are waiting for finish event.
  6. lib/net is waiting for finish event here.
  7. We are calling incomingStream.destroy in our finish handler. But this doesn't matter.
  8. function onSocketFinish from lib/net is called here.
  9. onSocketFinish creates a shutdown request.
  10. Our stream will be destroyed after this shutdown request, so we will have a not null stream->shutdown_req in uv__stream_destroy here.
  11. We will receive process._errno === ECANCELED in function onread here.
  12. _destroy will be called with process._errno as first argument here.
  13. fireErrorCallbacks will be called anyway. It wont respect destroyed = true. So order of destroy calls doesn't matter.
  14. So we should see ECONNREFUSED always.

But we couldn't see ECONNREFUSED with wifi. Why? close from lib/net will unbind onread from our socket and we will just ignore process._errno.

Result:

  1. If close is faster than onread we will just ignore process._errno and we won't see ECONNREFUSED.
  2. If onread is faster than close we will see ECONNREFUSED.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 9, 2016

We've found another problem that produces a 500 error for us.

BTW we've found a proper way to destroy piped sockets in lib/tls:

  pair.encrypted.on('close', function() {
    process.nextTick(function() {
      // Encrypted should be unpiped from socket to prevent possible
      // write after destroy.
      pair.encrypted.unpipe(socket);
      socket.destroySoon();
    });
  });

destroySoon is not required for us, we can use destroy instead. process.nextTick is not important too.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 10, 2016

Socket hang up is meaningless.

function socketCloseListener() {
  ...
  } else if (!req.res && !req._hadError) {
    var error = createHangUpError(1);

It will create hang up without any error if it couldn't obtain a response. Why? Because request need it.

  if (self.req && self.req._reusedSocket && error.code === 'ECONNRESET'
      && self.agent.addRequestNoreuse) {
    self.agent = { addRequest: self.agent.addRequestNoreuse.bind(self.agent) }
    self.start()
    self.req.end()
    return
  }

So we have an infinite loop. We couldn't cancel our replication. Why? pouchdb cancel will try to abort current request. request will try to destroy lib/tls socket. It will throw a socket hang up and this will be a reason for request to recreate itself.

There is a commit where !req.res appears. The author of this commit is a team member of request.js.

// This socket error fired before we started to receive a response. The error needs to fire on the request.

This comment maybe means that request.js needs this socket hang up, but for us it is meaningless.

We need to find a way to break this infinite loop without exceptions.

@yaronyg
Copy link
Member Author

yaronyg commented Nov 14, 2016

I suspect that when we are the ones who are responsible for closing things down then we need a flag somewhere on the replication action that tells us that so that when the replication exits with a 500 we can check the flag and realize that it's a meaningless failure and in fact we exited as expected. This flag should probably be hooked to both kill as well as to replication time out.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 16, 2016

I've found a reliable way to reproduce and understand this issue.

  1. notification or replication action creates a request.
  2. request creates https request.
    3, https request creates a tls socket.
  3. tls socket creates a raw underlying socket with forever agent.
  4. this socket connects to the server.
  5. server creates a new raw incoming socket.
  6. we have a pair: request socket <-> incoming.
  7. openssl is doing a handshake through incoming socket.
  8. tls socket is waiting for secureConnection event.

Than we can see the issue:

  1. sometimes openssl is not able to do a handshake, it is not able to send and receive all required data.
  2. tls socket is not able to emit secureConnection event.
  3. we are destroying incoming socket and server.
  4. lib/net will try to send a shutdown request to destroy a request socket.
  5. lib/http see that underlying request socket was closed and we have no error.
  6. !req.res && !req._hadError is true and lib/http provides a hang up.
  7. request receives hang up and recreates itself.
  8. request is going to create another raw socket and connect to the destroyed server.
  9. this connection attempt returns ECONNREFUSED.

We will try to find a way to ignore this error with some flag for now.

In future we need to upgrade our createPeerListener code. Here we have function closeServer(self, server, failedConnectionErr, canRetry). Here we have a good reason to destroy incoming.

  1. we want to destroy incoming with failedConnectionErr and wait for close event.
  2. we need to override closeAll method for server, it will destroy all connections with provided error and than it will close itself.
  3. we need to reject any new connection to the server when it has stopping state.
  4. we need to verify that request received our error and req._hadError is true, request shouldn't recreate itself in future.

@yaronyg yaronyg added Node and removed Android labels Nov 17, 2016
@andrew-aladev
Copy link
Contributor

I've finished today with debugging this issue.

In deps/uv/src/unix/kqueue.c we have void uv__io_poll_jx.

I've checked that our uv__read call was cased by w->cb(loop, w, revents);.

We can add here:

fprintf(
  stderr,
  "ololo new event from kqueue, vnode: %d, read: %d, write: %d, flags: %d, fflags: %d\n",
  ev->filter == EVFILT_VNODE, ev->filter == EVFILT_READ, ev->filter == EVFILT_WRITE, ev->flags, ev->fflags
);

This will produce:

new event from kqueue, vnode: 0, read: 1, write: 0, flags: 32769, fflags: 0
new event, in: 1, err: 0, hup: 0
uv__read, stream: 0x100d472e8
system read, nread: 0, errno: 35, stream: 0x100d472e8
onread, stream: 0x100d472e8, nread: -1, errno: 1
socket onread received EOF, id: 9

flags: 32769 means flags = EV_ADD | EV_EOF. So we received 0 bytes when POLLIN is active and we received special kqueue flag EV_EOF.

So our peer connection was closed, because the remote peer closed connection.

@yaronyg
Copy link
Member Author

yaronyg commented Nov 18, 2016

But why did the remote peer close the connection?

Scenario:
Time 0 - Device A opens a Bluetooth connection to Device B
Time 1 - Device A starts replicating with Device B
Time 2 - Device A fails replication with a 500 because Device B closed the connection

Why did Device B close the connection?

There is literally no logic anywhere in Express-PouchDB that would close a connection on the server side. It only closes in response to the client closing a connection.

So either the Bluetooth connection was lost (in which case you should see a nonTCPPeerAvailabilityChanged event in the logs with a null port not to mention error messages from the Android layer) or we have a bug in Express that is closing connections for some bizarre reason.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Nov 21, 2016

Today we've found that jxcore special timers don't want to work with our timeout:

NON_TCP_PEER_UNAVAILABILITY_THRESHOLD: 1000 * 60 * 60 * 24 * 365 * 10
var time = new Date();
logger.debug('set socket timeout: %d', thaliConfig.NON_TCP_PEER_UNAVAILABILITY_THRESHOLD);
incoming.setTimeout(thaliConfig.NON_TCP_PEER_UNAVAILABILITY_THRESHOLD, function () {
  logger.debug('socket timeout, time: %d', new Date() - time);
});

We will receive:

set socket timeout: 315360000000
socket timeout, time: 5001

This 5000 goes from lib/jx/_jx_timers.js. We can see that this.enroll = function(item, tsecs) { was called with tsecs: 63072000. 63072000 is less than (1 << 31) -1. So this will be an issue for jxcore. 9 years timeout works fine, 10 years timeout is broken.

We can workaround this issue by using 10 days timeout instead of 10 years. This number 1000 * 60 * 60 * 10 is less than (1 << 31) - 1. It will keep everyone happy.

@andrew-aladev
Copy link
Contributor

I am going to provide a workaround for jxcore issue.

@yaronyg yaronyg closed this as completed Nov 21, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants