Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA setup #323

Open
ctosae opened this issue Apr 19, 2021 · 9 comments
Open

HA setup #323

ctosae opened this issue Apr 19, 2021 · 9 comments

Comments

@ctosae
Copy link

ctosae commented Apr 19, 2021

Following previous conversations it seems that this is the best way to try to deploy an "HA" configuration:

  • one TMKMS (Active) connected to multiple validatords in the same chain-id
  • keep a second TMKSM (Passive)

tendermint/tmkms#272

Is it safe?
Have any further developments or improvement been made?

@tony-iqlusion
Copy link
Member

Is it safe?

It's received some degree of testing and some validators run in this configuration.

Have any further developments or improvement been made?

Not yet. We'll be switching to gRPC for validator <-> TMKMS connections soon (#73), after which TMKMS will track an "active" validator node and the others will be passive until the active validator fails.

After this migration is completed, we'll look into HA for TMKMS itself.

@ctosae
Copy link
Author

ctosae commented May 18, 2021

I was also thinking about this solution, it seems safer to me even in case of any bugs.

TMKMS01+HSM --+> VALIDATOR +--> SENTRY1
TMKMS02+HSM --|            |--> SENTRY2
                           |--> SENTRY3

(TMKSM01 and TMKSM02 "state_file" are NOT in sync)

Since a VALIDATOR node accepts only one Tendermint connection (from an external PrivValidator process),
this could be a way to create a redundancy for TMKMS.

A disaster recovery in case of VALIDATOR fault could be connecting one of two TMKMS to a SENTRY.

What do you think about it?

@tony-iqlusion
Copy link
Member

The migration to gRPC reverses the direction of the connection, so validators connect to TMKMS rather than the other way around.

We'll likely want to deprecate/phase out the current "secret connection"-based approach.

@pratikbin
Copy link

pratikbin commented Dec 7, 2022

It's received some degree of testing and some validators run in this configuration.

I've setup 2 validators 1 tmkms, and From logs, I can see tmkms responding to both validators with same key. So looks like it is risky for now?!


testnet2  | 12:47PM INF committed state app_hash=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4 height=3413176 module=state num_txs=1
tmkms     | 2022-12-07T12:47:44.943746Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] received request: ShowPublicKey(PubKeyRequest)
tmkms     | 2022-12-07T12:47:44.943764Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] sending response: PublicKey(PubKeyResponse { pub_key_ed25519: [74, 21, 224, 140, 241, 58, 2, 66, 174, 235, 92, 12, 46, 136, 122, 138, 1, 185, 116, 106, 248, 39, 144, 141, 43, 121, 23, 2, 181, 84, 236, 248] })
testnet3  | 12:47PM INF commit synced commit=436F6D6D697449447B5B323234203437203320322031303120352032303820313320313836203139342038302034332032333120302032333620373320383520313931203235332039312032313920323035203337203634203238203135322031353420313831203136352031303220313830203138305D3A3334313442387D
testnet3  | 12:47PM INF committed state app_hash=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4 height=3413176 module=state num_txs=1
testnet2  | 12:47PM INF indexed block height=3413176 module=txindex
tmkms     | 2022-12-07T12:47:44.951079Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] received request: ShowPublicKey(PubKeyRequest)
tmkms     | 2022-12-07T12:47:44.951103Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] sending response: PublicKey(PubKeyResponse { pub_key_ed25519: [74, 21, 224, 140, 241, 58, 2, 66, 174, 235, 92, 12, 46, 136, 122, 138, 1, 185, 116, 106, 248, 39, 144, 141, 43, 121, 23, 2, 181, 84, 236, 248] })
testnet3  | 12:47PM INF indexed block height=3413176 module=txindex

tmkms     | 2022-12-07T12:47:48.277496Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] received request: ReplyPing(PingRequest)
tmkms     | 2022-12-07T12:47:48.277545Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] sending response: Ping(PingResponse)
tmkms     | 2022-12-07T12:47:48.284968Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] received request: ReplyPing(PingRequest)
tmkms     | 2022-12-07T12:47:48.285011Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] sending response: Ping(PingResponse)
testnet2  | 12:47PM INF Timed out dur=4912.38743 height=3413177 module=consensus round=0 step=1
testnet3  | 12:47PM INF Timed out dur=4917.999917 height=3413177 module=consensus round=0 step=1
testnet3  | 12:47PM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E","parts":{"hash":"8F16936C7F74ECFD600C14CDC7A2277812FF5CBEA4580A11F84E7C017757A60C","total":1}},"height":3413177,"pol_round":-1,"round":0,"signature":"NgrA5gmuOw812cIAGc2Ef4HGGmr8I4iZeZyRrft8HOl6DWgYa/SkSFnq+v6pp6j3196KdgLkHrScj7hd17M4BQ==","timestamp":"2022-12-07T12:47:49.860448507Z"}
testnet3  | 12:47PM INF received complete proposal block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus
testnet2  | 12:47PM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E","parts":{"hash":"8F16936C7F74ECFD600C14CDC7A2277812FF5CBEA4580A11F84E7C017757A60C","total":1}},"height":3413177,"pol_round":-1,"round":0,"signature":"NgrA5gmuOw812cIAGc2Ef4HGGmr8I4iZeZyRrft8HOl6DWgYa/SkSFnq+v6pp6j3196KdgLkHrScj7hd17M4BQ==","timestamp":"2022-12-07T12:47:49.860448507Z"}
testnet2  | 12:47PM INF received complete proposal block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus
testnet2  | 12:47PM INF finalizing commit of block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus num_txs=0 root=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4
testnet3  | 12:47PM INF finalizing commit of block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus num_txs=0 root=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4
testnet2  | 12:47PM INF minted coins from module account amount=72399106umntl from=mint module=x/bank
testnet2  | 12:47PM INF executed block height=3413177 module=state num_invalid_txs=0 num_valid_txs=0
testnet2  | 12:47PM INF commit synced commit=436F6D6D697449447B5B3237203231342031333120323232203239203137392031333020313036203133362039372038392031383820313833203638203131372031352031343420323033203133362031383220313139203133352034203138332032333320313934203539203133302031393920313434203735203234345D3A3334313442397D
testnet2  | 12:47PM INF committed state app_hash=1BD683DE1DB3826A886159BCB744750F90CB88B6778704B7E9C23B82C7904BF4 height=3413177 module=state num_txs=0
testnet3  | 12:47PM INF minted coins from module account amount=72399106umntl from=mint module=x/bank
testnet3  | 12:47PM INF executed block height=3413177 module=state num_invalid_txs=0 num_valid_txs=0

@tony-iqlusion
Copy link
Member

tony-iqlusion commented Dec 7, 2022

@pratikbin it's intended and semi-supported to allow multiple concurrent validators. We don't recommend that but it's been tested and no one has reported problems yet.

In that case they're signing the same commit hashes. It's deliberately supported to be able to resign the exact same hash at the exact same h/r/s for fault tolerance purposes. The signature process is deterministic and this will result in the same signature on the same proposal, which doesn't count as double signing.

In the event multiple validators send conflicting proposals, the first validator will "win" and the other validator will receive a double signing error

@albttx
Copy link

albttx commented Feb 26, 2023

Hello,

I'm interested by running this kind of setup.

  • 1 tmkms runned by an orchestrator for redundancy
  • multiple validators node connected to tmkms for HA.

But i'm questioning for this architecture about the node_key.json .

Should i set 2 nodes with the same node_key

and have a config like:

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

or set different node_key

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

Any ETA on a HA status update ?

@pratikbin
Copy link

pratikbin commented Feb 27, 2023

@albttx AFAIK, It won't join p2p with same node_key since it's tendermint p2p key

@activenodes
Copy link

activenodes commented Aug 18, 2023

@tony-iqlusion is there any news about HA?
Or could you review and support configurations like the previous ones? (from @albttx)
I'm testing Horcrux for the first time and it do that.. TMKMS keeps closing the connection (prevent double-sign)
Thanks

@tony-iqlusion
Copy link
Member

We've largely been waiting for a migration to gRPC, which will reverse the client/server relationship between the KMS and validator nodes. Instead of having to explicitly configure several validators for the KMS to connect to, multiple validators can connect to the KMS.

That's tracked here: cometbft/cometbft#476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants