Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): Generate external Kubernetes and JobSet models #2466

Merged
merged 8 commits into from
Mar 5, 2025

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Mar 3, 2025

I upgraded the OpenAPI generator to the latest version using the Docker image, similar to JobSet: docker.io/openapitools/openapi-generator-cli:v7.11.0.

In this version, we can serialize and deserialize objects directly within the Model classes, eliminating the need to use FakeResponse for serialization. However, it requires all external dependencies to be present in the models. For example, JobsetV1alpha2JobSet uses V1ObjectMeta from Kubernetes OpenAPI spec.

Therefore, I suggest updating our swagger.json to reference external models from both Kubernetes and JobSet.

TODO:

Please let me know what do you think ?

/assign @kubeflow/wg-training-leads @astefanutti @kannon92 @saileshd1402 @seanlaii @Electronic-Waste

/hold for review

Comment on lines +44 to +49
for _, dep := range info.Deps {
if dep.Path == "k8s.io/api" {
k8sVersion = strings.Replace(dep.Version, "v0.", "v1.", -1)
} else if dep.Path == "sigs.k8s.io/jobset" {
jobSetVersion = dep.Version
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look good to fetch Kubernetes and JobSet version ?

@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from 162056f to a3f732f Compare March 3, 2025 14:58
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. I left my initial comments for you.

Comment on lines 58 to 59
// TODO (andreyvelich): Use the JobSet source when PR is merged.
jobSetOpenAPISpec := "https://raw.githubusercontent.com/andreyvelich/jobset/refs/heads/fix-swagger-ref/hack/python-sdk/swagger.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the JobSet source now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to create PR to update the JobSet swagger.
Will do it soon.

Comment on lines 33 to 40
docker run --rm \
-v "${TRAINER_ROOT}:/local" docker.io/openapitools/openapi-generator-cli:${OPENAPI_GENERATOR_VERSION} generate \
-g python \
-i "local/${SWAGGER_CODEGEN_FILE}" \
-c "local/${SWAGGER_CODEGEN_CONF}" \
-o "local/${SDK_OUTPUT_PATH}" \
-p=packageVersion="${SDK_VERSION}" \
--global-property apiTests=false,modelTests=false # TODO (andreyvelich): Discuss if we should use these test files.
--global-property models,modelTests=false,supportingFiles=__init__.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we use the container to generate Python SDK? Can we just use the jar file?

And also, Docker will put limit on the pull rate starting from 1st April. Is it good to download this image from Dockerhub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, using docker container is preferable to make generation platform agnostic: kubernetes-sigs/jobset#190

And also, Docker will put limit on the pull rate starting from 1st April. Is it good to download this image from Dockerhub?

I think, that should be fine for now. If we see problems in our CI, maybe we can mirror this image to Kubeflow GHCR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliusvonkohout @Electronic-Waste Do we know if this image is available in ECR gallery ?
https://gallery.ecr.aws/

docker.io/openapitools/openapi-generator-cli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I think, that should be fine for now. If we see problems in our CI, maybe we can mirror this image to Kubeflow GHCR." Just make sure that you update it automatically and regularly to ghcr to fix CVEs.

@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from a3f732f to 0bd7876 Compare March 3, 2025 16:17
@andreyvelich andreyvelich changed the title feat(sdk): Generate external Kubernetes and JobSet models fix(sdk): Generate external Kubernetes and JobSet models Mar 3, 2025
@andreyvelich andreyvelich changed the title fix(sdk): Generate external Kubernetes and JobSet models feat(sdk): Generate external Kubernetes and JobSet models Mar 3, 2025
@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from 0d27e69 to d7a7abe Compare March 3, 2025 18:17
@andreyvelich
Copy link
Member Author

andreyvelich commented Mar 3, 2025

Also, I removed the Models docs from the OpenAPI generation for now.
Since we don't have plans to expose the Kubeflow Trainer CRDs to the end-user, this is not needed.
In the future, we should build our docs based on kubeflow.trainer.types

Does it look good ?

@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from d7a7abe to e0ae1dd Compare March 5, 2025 16:11
@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from e0ae1dd to f844687 Compare March 5, 2025 16:11
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@@ -39,7 +39,14 @@ cd manifests/overlays/manager
kustomize edit set image kubeflow/trainer-controller-manager=${CONTROLLER_MANAGER_CI_IMAGE}

echo "Create Kind cluster and load Kubeflow Trainer images"
${KIND} create cluster --image "${KIND_NODE_VERSION}"
cat <<EOF | ${KIND} create cluster --image "${KIND_NODE_VERSION}" --config=-
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this config, so we can add more Kind workers in the future to decrease kube-scheduler time for placing Pods.
I think, that should speedup our tests once we run more TrainJobs at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open a separate PR? Because this seems not to related to openapi-gen.
If we find any issue for Kind config adding, the dedicated PR allows us to easily revert and fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me open it.

Signed-off-by: Andrey Velichkevich <[email protected]>
@@ -36,11 +36,17 @@ PROJECT_DIR := $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST))))
# Tool Binaries
LOCALBIN ?= $(PROJECT_DIR)/bin

CONTROLLER_GEN ?= $(LOCALBIN)/controller-gen
GINKGO ?= $(LOCALBIN)/ginkgo
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I noticed a new error in our e2e:

Will run 1 of 1 specs
panic: test timed out after 10m0s
	running tests:
		TestAPIs (10m0s)

goroutine 2578 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:2373 +0x385
created by time.goFunc
	/opt/hostedtoolcache/go/1.23.0/x64/src/time/sleep.go:215 +0x2d

goroutine 1 [chan receive, 10 minutes]:
testing.(*T).Run(0xc0000f44e0, {0x1999495?, 0x0?}, 0x1aaae28)
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:1751 +0x3ab
testing.runTests.func1(0xc0000f44e0)
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:2168 +0x37
testing.tRunner(0xc0000f44e0, 0xc0000a9c70)
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:1690 +0xf4
testing.runTests(0xc00038e2b8, {0x29962d0, 0x1, 0x1}, {0x4[74](https://github.com/kubeflow/trainer/actions/runs/13681536090/job/38254906117?pr=2466#step:7:75)a25?, 0xc000687350?, 0x29de060?})
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:2166 +0x43d
testing.(*M).Run(0xc0002aae60)
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:2034 +0x64a
main.main()
	_testmain.go:45 +0x9b

goroutine 72 [select, 10 minutes]:
github.com/onsi/ginkgo/v2/internal.(*Suite).runNode(_, {0x6, 0x4, {0x19e625f, 0x35}, 0xc0006942a0, {{0x200d249, 0x59}, 0x2d, {0x0, ...}, ...}, ...}, ...)
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/suite.go:916 +0xfc5
github.com/onsi/ginkgo/v2/internal.(*group).attemptSpec(0xc000169810, 0x1, {{0xc0001bd208?, 0xc000251b90?, 0x2?}, 0x0?})
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/group.go:199 +0xc33
github.com/onsi/ginkgo/v2/internal.(*group).run(0xc000169810, {0xc000365280?, 0x0?, 0x0?})
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/group.go:349 +0x905
github.com/onsi/ginkgo/v2/internal.(*Suite).runSpecs(0xc0001d2a88, {0x19b143a, 0x1a}, {0x2a009c0, 0x0, 0x0}, {0xc000050064, 0x4d}, 0x0, {0xc0001460a0, ...})
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/suite.go:489 +0xa9b
github.com/onsi/ginkgo/v2/internal.(*Suite).Run(_, {_, _}, {_, _, _}, {_, _}, _, {0x1c58f30, ...}, ...)
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/suite.go:130 +0x405
github.com/onsi/ginkgo/v2.RunSpecs({0x1c363c0, 0xc00026[78](https://github.com/kubeflow/trainer/actions/runs/13681536090/job/38254906117?pr=2466#step:7:79)60}, {0x19b143a, 0x1a}, {0x0, 0x0, 0x0})
	/home/runner/work/trainer/trainer/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/core_dsl.go:300 +0x893
github.com/kubeflow/trainer/test/e2e.TestAPIs(0xc000267860)
	/home/runner/work/trainer/trainer/go/src/github.com/kubeflow/trainer/test/e2e/suite_test.go:39 +0x45
testing.tRunner(0xc000267[86](https://github.com/kubeflow/trainer/actions/runs/13681536090/job/38254906117?pr=2466#step:7:87)0, 0x1aaae28)
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:1690 +0xf4
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.23.0/x64/src/testing/testing.go:1743 +0x390

Since we use ginkgo to run our tests, can we use their binary to run tests similar to JobSet and Kueue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ginkgo binary would be more better

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the generate-sdk-docker branch from 93b4b7f to bbe2d4b Compare March 5, 2025 17:47
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

The e2e seems to be working. I don't think flakiness relates to this PR.
/assign @tenzen-y @astefanutti @Electronic-Waste for review

@andreyvelich
Copy link
Member Author

/hold cancel

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 35a1c3b into kubeflow:master Mar 5, 2025
14 checks passed
@andreyvelich andreyvelich deleted the generate-sdk-docker branch March 5, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants