Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the naming conventions for Kubeflow Trainer #2415

Merged
merged 5 commits into from
Feb 6, 2025

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 4, 2025

Part of: #2170
This PR updates the naming conventions for Kubeflow Trainer.
Also, it changes CRDs to v1alpha1.

Please give me your thoughts on the naming if that looks good.

Kubeflow Trainer Images:

docker.io/kubeflow/trainer-controller-manager:<SHA>
docker.io/kubeflow/model-initializer:<SHA>
docker.io/kubeflow/dataset-initializer:<SHA>

Kubeflow Trainer k8s resources:

$ k get deploy -n kubeflow-system

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
jobset-controller-manager             1/1     1            1           63m
kubeflow-trainer-controller-manager   1/1     1            1           63m

$ k get svc -n kubeflow-system

NAME                                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)            AGE
jobset-controller-manager-metrics-service   ClusterIP   10.96.40.9      <none>        8443/TCP           64m
jobset-webhook-service                      ClusterIP   10.96.47.150    <none>        443/TCP            64m
kubeflow-trainer-controller-manager         ClusterIP   10.96.229.157   <none>        8080/TCP,443/TCP   64m

$ k get sa -n kubeflow-system 

NAME                                  SECRETS   AGE
jobset-controller-manager             0         65m
kubeflow-trainer-controller-manager   0         65m

$ k get ValidatingWebhookConfiguration
NAME                                      WEBHOOKS   AGE
jobset-validating-webhook-configuration   2          65m
validator.trainer.kubeflow.org            3          65m

TODOs:

  • Update manifests
  • Update GitHub actions
  • Update SDK Trainer client.

/assign @tenzen-y @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii @kannon92

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines -141 to -158
" if local_rank == 0:\n",
" # Only download the dataset from local rank 0\n",
" dataset = datasets.FashionMNIST(\n",
" \"./data\",\n",
" train=True,\n",
" download=True,\n",
" transform=transforms.Compose([transforms.ToTensor()]),\n",
" )\n",
" dist.barrier()\n",
" else:\n",
" # Wait for local rank 0 to complete downloading the dataset and load it\n",
" dist.barrier()\n",
" dataset = datasets.FashionMNIST(\n",
" \"./data\",\n",
" train=True,\n",
" download=False,\n",
" transform=transforms.Compose([transforms.ToTensor()]),\n",
" )\n",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dataset logic from this example for now. Since this Notebook demonstrates single proc per Node we should be good.

I think, we can find more elegant way to download data only on LOCAL_RANK=0 process.

WDYT @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I agree we can revisit this and find a better approach.

@andreyvelich
Copy link
Member Author

cc @akshaychitneni @shravan-achar

@andreyvelich andreyvelich force-pushed the rename-to-trainer branch 2 times, most recently from 786f92e to fbc6cb9 Compare February 5, 2025 12:55
@andreyvelich andreyvelich changed the title [WIP] Update the naming conventions for Kubeflow Trainer Update the naming conventions for Kubeflow Trainer Feb 5, 2025
@Electronic-Waste
Copy link
Member

@andreyvelich We may need to rebase this PR since #2413 changed the API Group.

@andreyvelich
Copy link
Member Author

It should be good now.
Please help with the review @Electronic-Waste @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo

Signed-off-by: Andrey Velichkevich <[email protected]>
@astefanutti
Copy link
Contributor

@andreyvelich Looks good to me.

Not strictly related to the scope of this PR but I see you've fixed the webhooks paths already :)

@andreyvelich
Copy link
Member Author

@tenzen-y @Electronic-Waste @astefanutti Any thoughts what I forgot to fix to resolve problem with the Webhook ?

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

It looks like Webhook paths, should have the correct address.

@astefanutti
Copy link
Contributor

It looks correct to me, according to what controller-runtime generates: https://github.com/kubernetes-sigs/controller-runtime/blob/bbc9711d9b2db31ce828714e7d3c875c00b4c1f1/pkg/builder/webhook.go#L287

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just some minor comments

Makefile Outdated
@@ -102,7 +102,7 @@ endif
# Instructions to run tests.
.PHONY: test
test: ## Run Go unit test.
go test ./pkg/apis/trainer/v2alpha1/... ./pkg/controller.v2/... ./pkg/runtime.v2/... ./pkg/webhooks.v2/... ./pkg/util.v2/... -coverprofile cover.out
go test ./pkg/apis/trainer/v1alpha1/... ./pkg/controller/... ./pkg/runtime/... ./pkg/webhooks/... ./pkg/util/... -coverprofile cover.out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
go test ./pkg/apis/trainer/v1alpha1/... ./pkg/controller/... ./pkg/runtime/... ./pkg/webhooks/... ./pkg/util/... -coverprofile cover.out
go test $(shell go list ./... | grep -v '/test/') -coverprofile cover.out

Currently, all Go files should be tested. When I set up this, we have 2 version codes (v1 and v2). So, I specify the dedicated directory.

runtime "github.com/kubeflow/training-operator/pkg/runtime.v2"
runtimecore "github.com/kubeflow/training-operator/pkg/runtime.v2/core"
webhooksv2 "github.com/kubeflow/training-operator/pkg/webhooks.v2"
kubeflowv1 "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubeflowv1 "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
kubeflow "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"

Now, we can just say "kubeflow".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Can we keep the kubeflowv1 for now since in the future, we might want to introduce another APIs: v1alpha2 and we should distinguish it in the controller ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we may need kubeflowv1alpha1 to distinguish versions like v1alpha2 from v1alpha1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind do it in the follow-up PR, given the number of changes this PR already has ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,I would like to help this:)

Copy link
Member

@tenzen-y tenzen-y Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using kubeflow since we should make the import name shorter as much as possible.
Typically, we use kubeflow or trainer for the latest internal API version.
After we graduate v1alpha1 to v1beta1, and then introduce new API with v1alpha1, we call the new alpha API as kubeflowv1alpha1 or trainerv1alpha1.

We can imagine the situation where Kube impl uses core as import name, but, external projects use corev1.

https://go.dev/blog/package-names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use trainer to be more explicit here.

runtimecore "github.com/kubeflow/training-operator/pkg/runtime.v2/core"
webhooksv2 "github.com/kubeflow/training-operator/pkg/webhooks.v2"
kubeflowv1 "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
kubeflowcontroller "github.com/kubeflow/trainer/pkg/controller"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubeflowcontroller "github.com/kubeflow/trainer/pkg/controller"
"github.com/kubeflow/trainer/pkg/controller"

Have you seen any conflicts with "controller"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we typically use ctrl "sigs.k8s.io/controller-runtime/pkg/controller".
In go, the package name should be shorter as much as possible. If the shorter name is popular, we can use the shorter name like controller -> ctrl , format -> fmt.

https://go.dev/blog/package-names

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Just a few comments for you @andreyvelich

"github.com/kubeflow/trainer/pkg/runtime"
runtimecore "github.com/kubeflow/trainer/pkg/runtime/core"
"github.com/kubeflow/trainer/pkg/util/cert"
webhooks "github.com/kubeflow/trainer/pkg/webhooks"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to distinguish different versions of webhooks just like APIs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtimecore "github.com/kubeflow/training-operator/pkg/runtime.v2/core"
webhooksv2 "github.com/kubeflow/training-operator/pkg/webhooks.v2"
kubeflowv1 "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
kubeflowcontroller "github.com/kubeflow/trainer/pkg/controller"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the controller. Do we want to distinguish different versions of controllers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ "For example: [v1.7.0](https://github.com/kubeflow/training-operator/releases/tag/v1.7.0)",
+ "For example: [v1.7.0](https://github.com/kubeflow/trainer/releases/tag/v1.7.0)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we enabled url redirection? If so, we may not need to change this url since v1.7.0 is the release of training-operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, once we rename the repo GitHub automatically add the redirects it.

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

@tenzen-y @Electronic-Waste I renamed kubeflowv1 to trainer.

@tenzen-y
Copy link
Member

tenzen-y commented Feb 6, 2025

Great to see, thanks.
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3060332 into kubeflow:master Feb 6, 2025
14 checks passed
@andreyvelich andreyvelich deleted the rename-to-trainer branch February 6, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants