Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to disable worker identity reuse for registration unless in k8s #18454

Merged
merged 8 commits into from
Dec 9, 2023

Conversation

lucyge2022
Copy link
Contributor

What changes are proposed in this pull request?

Put back the restriction of reuse worker identity for non-k8s env deployment.

Why are the changes needed?

Worker identity gets reused by wrong deployment behaviors such as copy conf/ over for new worker setup, as opposed to k8s deployment is thru operator / automation, bare metal deployment has no way of prevention, thus putting back the restriction for non-k8s deployment.

Does this PR introduce any user facing changes?

No.

@lucyge2022 lucyge2022 requested review from ssz1997 and dbw9580 December 1, 2023 22:19
Copy link
Contributor

@ssz1997 ssz1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make this property a global one, i.e. K8S_ENV? So that other components, like master, may use it in the future without bringing confusion.

Comment on lines 111 to 114
throw new AlreadyExistsException(
"Some other member with same id registered on the ring, bail."
+ "Different workers can't assume same worker identity in non-k8s env."
+ "Clean local worker identity settings to continue.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include in the exception message the conflicting worker's addresses and the ID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 97 to 100
if (mConf.isSet(PropertyKey.WORKER_IN_K8S_ENV)
&& mConf.getBoolean(PropertyKey.WORKER_IN_K8S_ENV)) {
// In k8s this might be bcos worker pod restarting with the same worker identity
// but certain fields such as hostname has been changed. Register to ring path anyway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is worker ID generated in a Kubernetes deployment? Is there a chance there are actually two live workers with the same ID but different hostnames?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no special mechanism to generate IDs in k8s, so it's the random string, so theoretically that shouldn't happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no special mechanism to generate IDs in k8s, so it's the random string

Do you mean that the ID is not explicitly set via any Alluxio configuration properties by the operator or the helm chart, but gets generated at the first time the worker instance is created?

If so, sounds good to me as people rely on the k8s operator or the helm chart to configure new worker instances, rathar than copying Alluxio configs between workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, sounds good to me as people rely on the k8s operator or the helm chart to configure new worker instances, rathar than copying Alluxio configs between workers.

exactly, after ur worker id changes cluster deployment needs to rely on coordinated workflow. since theres no such thing for bare-metal, we have to safe-guard everything from code point of view.

@@ -93,16 +94,25 @@ public void join(WorkerInfo workerInfo) throws IOException {
if (existingEntityBytes != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (existingEntityBytes != null) {
// It's not me, or not the same me.

} else {
// If haven't created myself onto the ring before, create now.
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity));

is there a time-of-check to time-of-use race between these lines? Two workers with conflicting IDs may be registering at the same time, and both can happen to see no conflicting records exist on etcd so proceed to register with the ID, accidentally overwriting each other. I assume etcd transaction is not used for registering a worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah u r right, wasn't using txn coz before ur workerid change, no two workers will have same workernetaddr, which was used as the unique identifier then, to begin with. Will change to txn semantics here, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580 can u double check this logics to handle race for me plz?

@lucyge2022
Copy link
Contributor Author

Shall we make this property a global one, i.e. K8S_ENV? So that other components, like master, may use it in the future without bringing confusion.

done

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • First word must be capitalized
      • First word of title ("only") is not an imperative verb. Please use one of the valid words
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@lucyge2022 lucyge2022 changed the title only allow multiple workers assume same worker identity in k8s env Fix to only allow multiple workers assume same worker identity in k8s env Dec 5, 2023
@lucyge2022
Copy link
Contributor Author

alluxio-bot, check this please

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title is too long (73 characters). Must be at most 72 characters.
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only have a few small questions. Thanks for the work!

ByteSequence keyToPut = ByteSequence.from(pathOnRing, StandardCharsets.UTF_8);
ByteSequence valToPut = ByteSequence.from(serializedEntity);
CompletableFuture<TxnResponse> txnResponseFut = txn
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would appreciate a comment for etcd outsiders:

Suggested change
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L)))
// version of a key starts from 1; so version being 0 means
// the key does not exist (never existed or was deleted).
.If(new Cmp(keyToPut, Cmp.Op.EQUAL, CmpTarget.version(0L)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between version == 0 and createRevision == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'revision' refers to a global number put onto each op for all keys(create/delete), createRevision = revision of the last creation on the key, version = the version of this particular key, if deleted, the version is reset to 0
create /k1 -> /k1 version=1 global revision = 1
create /k2 -> /k2 ... global revision=2
delete /k2 -> /k2 ... global revision=3
modify /k1 -> /k1 version=2 global revision = 4 => /k1 version = 2 createRevision=1
delete /k1 -> /k1 version=0 global revision = 5
create /k1 -> /k1 version=1 global revision = 6 => /k1 version = 1 createRevision=6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1048,6 +1048,15 @@ public String toString() {
.setConsistencyCheckLevel(ConsistencyCheckLevel.ENFORCE)
.setScope(Scope.MASTER)
.build();
public static final PropertyKey K8S_ENV_DEPLOYMENT =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have the feeling that whether we allow worker to reg with the same identity is relevant to whether it's k8s, but k8s is not 100% the decider. In other words, I feel we can have

PropertyKey ON_K8S

and 

PropertyKey WORKER_SAME_IDENTITY_ACTION = ON_K8S ? OVERWRITE : REJECT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually identity should never be allowed to overwrite, from code side shouldnt even open this can of worm, it should be a deployment effort. k8s property can also be extended for other usage as suggested by @ssz1997

@lucyge2022 lucyge2022 changed the title Fix to only allow multiple workers assume same worker identity in k8s env Fix to disable worker identity reuse for registration unless in k8s Dec 7, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@lucyge2022
Copy link
Contributor Author

alluxio-bot, check this please

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@lucyge2022 lucyge2022 force-pushed the disable-workerid-reuse branch from 83c847b to 33fd329 Compare December 9, 2023 05:42
@lucyge2022 lucyge2022 added the type-bug This issue is about a bug label Dec 9, 2023
@lucyge2022
Copy link
Contributor Author

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit 22af1b1 into Alluxio:main Dec 9, 2023
12 checks passed
ssz1997 pushed a commit to ssz1997/alluxio that referenced this pull request Dec 15, 2023
### What changes are proposed in this pull request?

Put back the restriction of reuse worker identity for non-k8s env deployment.

### Why are the changes needed?

Worker identity gets reused by wrong deployment behaviors such as copy conf/ over for new worker setup, as opposed to k8s deployment is thru operator / automation, bare metal deployment has no way of prevention, thus putting back the restriction for non-k8s deployment.

### Does this PR introduce any user facing changes?

No.

			pr-link: Alluxio#18454
			change-id: cid-295ea352895b16c2a5f0a23fa790c9f42a5e3881
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants