-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to disable worker identity reuse for registration unless in k8s #18454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make this property a global one, i.e. K8S_ENV? So that other components, like master, may use it in the future without bringing confusion.
throw new AlreadyExistsException( | ||
"Some other member with same id registered on the ring, bail." | ||
+ "Different workers can't assume same worker identity in non-k8s env." | ||
+ "Clean local worker identity settings to continue."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include in the exception message the conflicting worker's addresses and the ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (mConf.isSet(PropertyKey.WORKER_IN_K8S_ENV) | ||
&& mConf.getBoolean(PropertyKey.WORKER_IN_K8S_ENV)) { | ||
// In k8s this might be bcos worker pod restarting with the same worker identity | ||
// but certain fields such as hostname has been changed. Register to ring path anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is worker ID generated in a Kubernetes deployment? Is there a chance there are actually two live workers with the same ID but different hostnames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no special mechanism to generate IDs in k8s, so it's the random string, so theoretically that shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no special mechanism to generate IDs in k8s, so it's the random string
Do you mean that the ID is not explicitly set via any Alluxio configuration properties by the operator or the helm chart, but gets generated at the first time the worker instance is created?
If so, sounds good to me as people rely on the k8s operator or the helm chart to configure new worker instances, rathar than copying Alluxio configs between workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, sounds good to me as people rely on the k8s operator or the helm chart to configure new worker instances, rathar than copying Alluxio configs between workers.
exactly, after ur worker id changes cluster deployment needs to rely on coordinated workflow. since theres no such thing for bare-metal, we have to safe-guard everything from code point of view.
@@ -93,16 +94,25 @@ public void join(WorkerInfo workerInfo) throws IOException { | |||
if (existingEntityBytes != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alluxio/dora/core/common/src/main/java/alluxio/membership/EtcdMembershipManager.java
Lines 94 to 95 in 87702fd
if (existingEntityBytes != null) { | |
// It's not me, or not the same me. |
alluxio/dora/core/common/src/main/java/alluxio/membership/EtcdMembershipManager.java
Lines 118 to 120 in 87702fd
} else { | |
// If haven't created myself onto the ring before, create now. | |
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity)); |
is there a time-of-check to time-of-use race between these lines? Two workers with conflicting IDs may be registering at the same time, and both can happen to see no conflicting records exist on etcd so proceed to register with the ID, accidentally overwriting each other. I assume etcd transaction is not used for registering a worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah u r right, wasn't using txn coz before ur workerid change, no two workers will have same workernetaddr, which was used as the unique identifier then, to begin with. Will change to txn semantics here, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580 can u double check this logics to handle race for me plz?
done |
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
alluxio-bot, check this please |
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only have a few small questions. Thanks for the work!
ByteSequence keyToPut = ByteSequence.from(pathOnRing, StandardCharsets.UTF_8); | ||
ByteSequence valToPut = ByteSequence.from(serializedEntity); | ||
CompletableFuture<TxnResponse> txnResponseFut = txn | ||
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would appreciate a comment for etcd outsiders:
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L))) | |
// version of a key starts from 1; so version being 0 means | |
// the key does not exist (never existed or was deleted). | |
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between version == 0
and createRevision == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'revision' refers to a global number put onto each op for all keys(create/delete), createRevision = revision of the last creation on the key, version = the version of this particular key, if deleted, the version is reset to 0
create /k1 -> /k1 version=1 global revision = 1
create /k2 -> /k2 ... global revision=2
delete /k2 -> /k2 ... global revision=3
modify /k1 -> /k1 version=2 global revision = 4 => /k1 version = 2 createRevision=1
delete /k1 -> /k1 version=0 global revision = 5
create /k1 -> /k1 version=1 global revision = 6 => /k1 version = 1 createRevision=6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dora/core/common/src/main/java/alluxio/membership/EtcdMembershipManager.java
Outdated
Show resolved
Hide resolved
@@ -1048,6 +1048,15 @@ public String toString() { | |||
.setConsistencyCheckLevel(ConsistencyCheckLevel.ENFORCE) | |||
.setScope(Scope.MASTER) | |||
.build(); | |||
public static final PropertyKey K8S_ENV_DEPLOYMENT = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have the feeling that whether we allow worker to reg with the same identity is relevant to whether it's k8s, but k8s is not 100% the decider. In other words, I feel we can have
PropertyKey ON_K8S
and
PropertyKey WORKER_SAME_IDENTITY_ACTION = ON_K8S ? OVERWRITE : REJECT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually identity should never be allowed to overwrite, from code side shouldnt even open this can of worm, it should be a deployment effort. k8s property can also be extended for other usage as suggested by @ssz1997
Automated checks report:
All checks passed! |
alluxio-bot, check this please |
Automated checks report:
All checks passed! |
…ipManager.java Co-authored-by: Bowen Ding <[email protected]>
83c847b
to
33fd329
Compare
alluxio-bot, merge this please. |
### What changes are proposed in this pull request? Put back the restriction of reuse worker identity for non-k8s env deployment. ### Why are the changes needed? Worker identity gets reused by wrong deployment behaviors such as copy conf/ over for new worker setup, as opposed to k8s deployment is thru operator / automation, bare metal deployment has no way of prevention, thus putting back the restriction for non-k8s deployment. ### Does this PR introduce any user facing changes? No. pr-link: Alluxio#18454 change-id: cid-295ea352895b16c2a5f0a23fa790c9f42a5e3881
What changes are proposed in this pull request?
Put back the restriction of reuse worker identity for non-k8s env deployment.
Why are the changes needed?
Worker identity gets reused by wrong deployment behaviors such as copy conf/ over for new worker setup, as opposed to k8s deployment is thru operator / automation, bare metal deployment has no way of prevention, thus putting back the restriction for non-k8s deployment.
Does this PR introduce any user facing changes?
No.