-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair task failed after an hour with zero token nodes in multi dc configuration #4078
Comments
@kbr-scylla @patjed41 , this but is not related to scylla directly ( at least i didn't find any issue in scylla logs) but scylla manager repair task failed and looks like it could be related to zero token nodes |
Could be that support for zero-token nodes needs to be explicitly implemented in Scylla Manager. Maybe it assumes that every node has tokens |
From SM logs we can see that it tries to repair 16 token ranges owned by a 6-node replica set:
The replica set ( From my understanding, SM is behaving correctly here, but the problem is that Scylla hangs on the repair status API call. There are some error Scylla logs on repair master (node2):
More Scylla logs from repair master (node2)
|
Packages
Scylla version:
6.2.0-20241013.b8a9fd4e49e8
with build-ida61f658b0408ba10663812f7a3b4d6aea7714fac
Kernel Version:
6.8.0-1016-aws
Scylla Manager Agent 3.3.3-0.20240912.924034e0d
Issue description
Cluster configured with zero token nodes and multi dc configuration. There are DC: "eu-west-1" with 3 data nodes, DC: "eu-west-2": 3 data nodes and 1 zero token nodes, DC: "eu-north-1": 1 zero token node.
Nemesis 'disrupt_mgmt_corrupt_then_repair' was failed. This nemesis stops scylla , remove several sstables, start scylla and then trigger repair from scylla manager. Nemesis chose node4 (data node) as target node. It remove sstables after scylla was stopped. And after scylla was started
triggered repair from scylla manager:
Repair task was failed after an hour:
Next error found in scylla manager log in "monitor-set-2bc4de73.tar.gz":
This could be related to zero token nodes in cofiguration.
Impact
Repair process failed from scylla manager.
Installation details
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-01f5cd2cb7c8dbd6f ami-0a32db7034cf41d95 ami-0b2b4e9fba26c7618
(aws: undefined_region)Test:
longevity-multi-dc-rack-aware-zero-token-dc
Test id:
2bc4de73-4328-4444-b601-6bd88060fa4d
Test name:
scylla-staging/abykov/longevity-multi-dc-rack-aware-zero-token-dc
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 2bc4de73-4328-4444-b601-6bd88060fa4d
$ hydra investigate show-logs 2bc4de73-4328-4444-b601-6bd88060fa4d
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: