Exception while using the Rapid Library #37

raviov · 2021-07-28T18:36:28Z

Hi Lalith,

I'm trying to use Rapid for membership information. Trying to execute the StandaloneAgent example provided in the examples directory. Below are the steps I'm executing.

Step1: on Terminatal1 executed the following command

$>java -cp standalone-agent.jar:. com.test.StandaloneAgent --listenAddress 127.0.0.1:1234 --seedAddress 127.0.0.1:1234

Output I'm seeing as follows

[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1

Step2: Opened Terminal2 and executed the following command

$>java -cp standalone-agent.jar:. com.test.StandaloneAgent --listenAddress 127.0.0.1:1235 --seedAddress 127.0.0.1:1234

Output:

Terminal1

[main] INFO com.vrg.rapid.Cluster - 127.0.0.1:1235 is sending a join-p2 to 127.0.0.1:1234 for config 3713649891269931577
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2

and the output on Terminal2 is as follows

[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Join at seed for {seed:127.0.0.1:1234, sender:127.0.0.1:1235, config:3713649891269931577, size:1}
[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Proposing membership change of size 1
[protocol-127.0.0.1:1234-0] INFO com.test.StandaloneAgent - The condition detector has outputted a proposal: ClusterStatusChange{configurationId=3713649891269931577, membership=[hostname: "127.0.0.1"
port: 1234
], delta=[127.0.0.1:1235:UP:]}
[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Decide view change called in current configuration 3713649891269931577 (1 nodes), for proposal [127.0.0.1:1235, ]
[protocol-127.0.0.1:1234-0] INFO com.test.StandaloneAgent - View change detected: ClusterStatusChange{configurationId=-4337195239393783641, membership=[hostname: "127.0.0.1"
port: 1235
, hostname: "127.0.0.1"
port: 1234
], delta=[127.0.0.1:1235:UP:]}
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 2

Step3:

I have killed the process in the Terminal2 and moved to the Terminal1. I'm getting following exception in the Terminal1

[bg-127.0.0.1:1234-0] ERROR com.vrg.rapid.messaging.impl.Retries - Retrying call to hostname: "127.0.0.1"
port: 1235
because of exception {}
io.grpc.StatusRuntimeException: UNAVAILABLE
at io.grpc.Status.asRuntimeException(Status.java:526)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:433)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:339)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:443)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:525)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:446)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:557)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:107)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:1235
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
[bg-127.0.0.1:1234-0] ERROR com.vrg.rapid.messaging.impl.Retries - Retrying call to hostname: "127.0.0.1"
port: 1235
because of exception {}

What I expected on Terminal1 was cluster size 1. However, I'm getting the above expection

Step 4:

on Terminal2, I'm trying to restart the process that I have killed in the step3 and getting the following exception

[main] ERROR com.vrg.rapid.Cluster - Join message to seed 127.0.0.1:1234 returned an exception: {}
com.vrg.rapid.Cluster$JoinPhaseTwoException
at com.vrg.rapid.Cluster$Builder.joinAttempt(Cluster.java:400)
at com.vrg.rapid.Cluster$Builder.join(Cluster.java:315)
at com.vrg.rapid.Cluster$Builder.join(Cluster.java:294)
at com.test.StandaloneAgent.startCluster(StandaloneAgent.java:44)
at com.test.StandaloneAgent.main(StandaloneAgent.java:100)
[main] INFO com.vrg.rapid.Cluster - 127.0.0.1:1235 is sending a join-p2 to 127.0.0.1:1234 for config -1

What I expect here is that, after restarting the process again, it will rejoin the cluster and cluster size should be 2.

Is the behaviour in the step3 and step4 are expected? Am I using the membership library correctly.
Can you please help me here?

Thanks
Ravi

lalithsuresh · 2021-07-28T22:26:36Z

Rapid is a consensus based membership service. Once you bootstrap a cluster (size > 1), you need a majority of processes to agree to each membership change. In this case, you had size == 2, and then dropped to size == 1, so there is no majority anymore.

Once you lose a majority, there is no safe way for Rapid automatically recover. You will need some out-of-band mechanism (like an admin command) that invokes Cluster.start() and start a new cluster.

For your particular exercise, try to create 5 nodes and try failing some.

raviov · 2021-07-29T00:07:43Z

Thank you Lalith.
One more question, can we use the nodes/machines that are belongs to multiple Data Centres to form a cluster?

lalithsuresh · 2021-07-29T01:33:56Z

Yes you can (the membership structure is flat and independent of topology).

You may want to consider plugging in your own messaging and failure detectors though (see the examples/ folder and also look at IEdgeFailureDetectorFactory).

lalithsuresh closed this as completed Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception while using the Rapid Library #37

Exception while using the Rapid Library #37

raviov commented Jul 28, 2021

lalithsuresh commented Jul 28, 2021

raviov commented Jul 29, 2021

lalithsuresh commented Jul 29, 2021 •

edited

Loading

Exception while using the Rapid Library #37

Exception while using the Rapid Library #37

Comments

raviov commented Jul 28, 2021

Step1: on Terminatal1 executed the following command

Output I'm seeing as follows

Step2: Opened Terminal2 and executed the following command

Output:

Terminal1

and the output on Terminal2 is as follows

Step3:

Step 4:

lalithsuresh commented Jul 28, 2021

raviov commented Jul 29, 2021

lalithsuresh commented Jul 29, 2021 • edited Loading

lalithsuresh commented Jul 29, 2021 •

edited

Loading