chore(test): Add E2E tests for Kubeflow Trainer #2470

andreyvelich · 2025-03-03T23:19:18Z

Fixes: #2213

This PR adds E2E tests for:

Simple TrainJob creation that reference existing torch runtime
Test for MNIST Notebook

review-notebook-app · 2025-03-04T01:03:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2025-03-04T12:33:39Z

sdk/kubeflow/trainer/utils/utils.py

@@ -122,6 +122,25 @@ def get_resources_per_node(resources_per_node: dict) -> client.V1ResourceRequire
    return resources


+# TODO (andreyvelich): Change return type to IntOrString.
+def get_num_proc_per_node(resources_per_node: dict) -> object:


@astefanutti I added this util func to get numProcPerNode based on CPU and GPU configuration, as we discussed before.
WDYT ?

@andreyvelich ah that's interesting, intuitively I would have defined that logic backend side in the torch plugin.
Not all the runtimes may have that "auto" logic, and some may be cgroups friendly.

Also the capping logic should take into account fractional CPU and millicore request / limit, which may be easier to achieved controller-side.

Adding #2407 for reference.

Oh, that is good point, yeah maybe we could move this part to the plugin itself.
Let's add the TODO statement to refactor it in the followup PRs, so we can integrate initial E2E tests working?

Sounds good, I agree with you, better integrate E2E asap, and we can work on #2407 separately.

SGTM, we can construct Torch environment variables based on input numProcPerNode as we do in the current torch plugin.

I think, what we try to say here is if user doesn't explicitly set numProcPerNode in TrainJob, we should construct this value based on Trainer container resources (if they are configured).

And we should do it inside the torch plugin on the server side, not in the get_num_proc_per_node() function on the client side.

Does it make sense @tenzen-y ?

Sorry for your confusion. That is what I wanted to say.

andreyvelich · 2025-03-04T13:59:07Z

This should be ready for review.
/assign @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @seanlaii @saileshd1402

andreyvelich · 2025-03-04T14:01:16Z

FYI, I was able to use large GitHub runners in our E2Es, after enabling it for kubeflow/trainer repo 🎉

runs-on:
  labels: ubuntu-latest-16-cores

I will create issue to inform Kubeflow community that we can use these runners.

Signed-off-by: Andrey Velichkevich <[email protected]>

Export Notebook as artifact Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2025-03-05T03:26:43Z

@tenzen-y Hopefully, I fixed all of the test flakiness. Please take a look when you can.

tenzen-y

Otherwise lgtm

tenzen-y · 2025-03-05T03:27:52Z

.github/workflows/test-e2e.yaml

+          echo "Install Kind"
+          go install sigs.k8s.io/[email protected]
+


Suggested change

echo "Install Kind"

go install sigs.k8s.io/[email protected]

You might forget to remove this.

tenzen-y · 2025-03-05T03:28:34Z

test/e2e/suite_test.go

+	trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
+	"github.com/onsi/ginkgo/v2"
+	"github.com/onsi/gomega"
+	"k8s.io/client-go/kubernetes/scheme"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+	"sigs.k8s.io/controller-runtime/pkg/client/config"


Suggested change

trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"

"github.com/onsi/ginkgo/v2"

"github.com/onsi/gomega"

"k8s.io/client-go/kubernetes/scheme"

"sigs.k8s.io/controller-runtime/pkg/client"

"sigs.k8s.io/controller-runtime/pkg/client/config"

"github.com/onsi/ginkgo/v2"

"github.com/onsi/gomega"

"k8s.io/client-go/kubernetes/scheme"

"sigs.k8s.io/controller-runtime/pkg/client"

"sigs.k8s.io/controller-runtime/pkg/client/config"

trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"

Signed-off-by: Andrey Velichkevich <[email protected]>

Electronic-Waste

@andreyvelich Thanks for this! Just a few comments.

Electronic-Waste · 2025-03-05T03:12:14Z

.github/workflows/test-e2e.yaml

+        # Kubernetes versions for e2e tests on Kind cluster.
+        kubernetes-version: ["1.29.14", "1.30.0", "1.31.0"]


Shall we support 1.32 since we plan to upgrade the Kubernetes version: #2448 ? Do we need to drop support for v1.29 after that?

That is out of the scope of this PR. #2448 contributor should address this.

Electronic-Waste · 2025-03-05T03:18:54Z

sdk/pyproject.toml

+    # TODO (andreyvelich): Update JobSet to v0.8.0 once this PR is merged: https://github.com/kubeflow/trainer/pull/2466
+    # "pydantic>=2.10.0",
+    "jobset @ git+https://github.com/kubernetes-sigs/[email protected]#subdirectory=sdk/python",


It might be better to use v0.7.3 since we used this version before: #2445

Actually, let me ping it directly to updated commit, once we merge this PR:
kubernetes-sigs/jobset#810

Electronic-Waste · 2025-03-05T03:24:38Z

Makefile

+# Input and output location for Notebooks executed with Papermill.
+NOTEBOOK_INPUT=$(PROJECT_DIR)/examples/pytorch/image-classification/mnist.ipynb
+NOTEBOOK_OUTPUT=$(PROJECT_DIR)/trainer_output.ipynb
+PAPERMILL_TIMEOUT=900
+.PHONY: test-e2e-notebook
+test-e2e-notebook: ## Run Jupyter Notebook with Papermill.
+	NOTEBOOK_INPUT=$(NOTEBOOK_INPUT) NOTEBOOK_OUTPUT=$(NOTEBOOK_OUTPUT) PAPERMILL_TIMEOUT=$(PAPERMILL_TIMEOUT) ./hack/e2e-run-notebook.sh


I would suggest that we use NOTEBOOK_INPUT_LIST and NOTEBOOK_OUTPUT_LIST for e2e test with Notebook. We may add more testcases in the future:)

It looks like you mentioned future work. So, we can keep current approach for now.
There are no guarantees and plans for additional tests for now.

Yeah, I agree @Electronic-Waste, but I want to discuss in the future how we can parallelize these steps, so we can run multiple Notebooks at the same time.
As @tenzen-y suggested, let's discuss it as a followup.

Electronic-Waste · 2025-03-05T03:25:28Z

.gitignore

+# The default output for Notebook after Papermill execution.
+trainer_output.ipynb


How about changing it to *_output.ipynb?

As you mentioned in #2470 (comment), let's add artifacts to gitignore.

Electronic-Waste · 2025-03-05T03:34:33Z

.github/workflows/test-e2e.yaml

+      # TODO (andreyvelich): Discuss how we can upload artifacts for multiple Notebooks.
+      - name: Upload notebook
+        uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: mnist_output_${{ matrix.kubernetes-version }}.ipynb
+          path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/mnist_output_${{ matrix.kubernetes-version }}.ipynb
+          retention-days: 1


Option1: Output files to a seperate directory
We can output these *_output.ipynb files to a certain directory like ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output, and upload the whole dir like:

- name: Upload all artifacts uses: actions/upload-artifact@v4 if: always() with: name: all-artifacts path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output/* retention-days: 1

Option2: regex match

- name: Upload all artifacts uses: actions/upload-artifact@v4 if: always() with: name: all-artifacts path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/*_output_${{ matrix.kubernetes-version }}.ipynb retention-days: 1

Instead of output, it might be better to use artifacts.

Great suggestion! I didn't know that upload artifact supports regex.
Let me create PR.

@andreyvelich You could check this: https://github.com/actions/upload-artifact#upload-using-a-wildcard-pattern

tenzen-y · 2025-03-05T03:53:11Z

@Electronic-Waste Could you address in the following in your side? If yes, could you say /hold cancel, then open PR for those?

/approve
/lgtm
/hold

Electronic-Waste · 2025-03-05T03:56:09Z

@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:)

/hold cancel

tenzen-y · 2025-03-05T03:56:34Z

@Electronic-Waste Could you address in the following in your side? If yes, could you say /hold cancel, then open PR for those?

chore(test): Add E2E tests for Kubeflow Trainer #2470 (comment)

chore(test): Add E2E tests for Kubeflow Trainer #2470 (comment)

chore(test): Add E2E tests for Kubeflow Trainer #2470 (comment)

/approve /lgtm /hold

Otherwise, Andrey will address those in this PR.

tenzen-y · 2025-03-05T03:56:46Z

@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:)

/hold cancel

Thanks!

Electronic-Waste

Thanks for this!
/lgtm

google-oss-prow · 2025-03-05T03:57:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Electronic-Waste, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 3, 2025

google-oss-prow bot requested review from jinchihe and kuizhiqing March 3, 2025 23:19

google-oss-prow bot added the size/L label Mar 3, 2025

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 7c96d36 to b20e562 Compare March 3, 2025 23:24

google-oss-prow bot added size/XL and removed size/L labels Mar 4, 2025

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 38ca1e6 to 957d8af Compare March 4, 2025 01:30

google-oss-prow bot added size/L and removed size/XL labels Mar 4, 2025

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from da2d9d8 to 5af466d Compare March 4, 2025 02:01

google-oss-prow bot added size/XL and removed size/L labels Mar 4, 2025

andreyvelich commented Mar 4, 2025

View reviewed changes

andreyvelich changed the title ~~[WIP] chore(test): Add E2E tests for Kubeflow Trainer~~ chore(test): Add E2E tests for Kubeflow Trainer Mar 4, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 4, 2025

andreyvelich force-pushed the issue-2213-e2e-notebooks branch 2 times, most recently from 0f60a1f to a036b42 Compare March 4, 2025 15:16

Add e2e tests for Kubeflow Trainer

7d2b51f

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from a036b42 to 7d2b51f Compare March 4, 2025 15:17

andreyvelich mentioned this pull request Mar 4, 2025

Using GitHub Large Runners for Kubeflow Infra kubeflow/community#829

Open

andreyvelich added 5 commits March 4, 2025 17:38

Add timeout for papermill

53b8a24

Signed-off-by: Andrey Velichkevich <[email protected]>

Add output as part of make command

7098143

Signed-off-by: Andrey Velichkevich <[email protected]>

Add k8s version to setup cluster

a309007

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix Kind k8s version

bba6e6b

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix 1.29 version

761a9fa

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from b577e24 to 761a9fa Compare March 4, 2025 18:18

andreyvelich added 4 commits March 4, 2025 22:17

Fix path for Kind package

59c9581

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix Go e2e

11e7dd1

Signed-off-by: Andrey Velichkevich <[email protected]>

Reduce number of CPUs

8e802ce

Export Notebook as artifact Signed-off-by: Andrey Velichkevich <[email protected]>

Print logs due to flaky test

657c36e

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 62f0611 to 657c36e Compare March 5, 2025 01:09

andreyvelich added 4 commits March 5, 2025 02:12

Fix artifact path

1f3eff9

Signed-off-by: Andrey Velichkevich <[email protected]>

docker pull image

c369368

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix path

f1533b0

Signed-off-by: Andrey Velichkevich <[email protected]>

Add k8s version to output name

188aeee

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 6835a6e to 188aeee Compare March 5, 2025 03:15

tenzen-y reviewed Mar 5, 2025

View reviewed changes

Remove install Kind cmd

13a8d06

Signed-off-by: Andrey Velichkevich <[email protected]>

Electronic-Waste reviewed Mar 5, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Mar 5, 2025

google-oss-prow bot assigned tenzen-y Mar 5, 2025

google-oss-prow bot added lgtm approved labels Mar 5, 2025

google-oss-prow bot removed the do-not-merge/hold label Mar 5, 2025

Electronic-Waste approved these changes Mar 5, 2025

View reviewed changes

google-oss-prow bot assigned Electronic-Waste Mar 5, 2025

google-oss-prow bot merged commit 9e78575 into kubeflow:master Mar 5, 2025
14 checks passed

andreyvelich deleted the issue-2213-e2e-notebooks branch March 5, 2025 12:16

This was referenced Mar 5, 2025

chore(test): Upload artifacts from dir #2473

Merged

Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob #2407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(test): Add E2E tests for Kubeflow Trainer #2470

chore(test): Add E2E tests for Kubeflow Trainer #2470

andreyvelich commented Mar 3, 2025 •

edited

Loading

review-notebook-app bot commented Mar 4, 2025

andreyvelich Mar 4, 2025

astefanutti Mar 4, 2025

andreyvelich Mar 4, 2025

astefanutti Mar 4, 2025

tenzen-y Mar 4, 2025

andreyvelich Mar 4, 2025

tenzen-y Mar 4, 2025

andreyvelich commented Mar 4, 2025

andreyvelich commented Mar 4, 2025 •

edited

Loading

andreyvelich commented Mar 5, 2025

tenzen-y left a comment

tenzen-y Mar 5, 2025

tenzen-y Mar 5, 2025

Electronic-Waste left a comment

Electronic-Waste Mar 5, 2025

tenzen-y Mar 5, 2025

Electronic-Waste Mar 5, 2025

andreyvelich Mar 5, 2025

Electronic-Waste Mar 5, 2025

tenzen-y Mar 5, 2025

andreyvelich Mar 5, 2025

Electronic-Waste Mar 5, 2025

tenzen-y Mar 5, 2025

Electronic-Waste Mar 5, 2025

tenzen-y Mar 5, 2025

andreyvelich Mar 5, 2025

Electronic-Waste Mar 5, 2025

tenzen-y commented Mar 5, 2025

Electronic-Waste commented Mar 5, 2025

tenzen-y commented Mar 5, 2025

tenzen-y commented Mar 5, 2025

Electronic-Waste left a comment

google-oss-prow bot commented Mar 5, 2025

		# Kubernetes versions for e2e tests on Kind cluster.
		kubernetes-version: ["1.29.14", "1.30.0", "1.31.0"]

		# The default output for Notebook after Papermill execution.
		trainer_output.ipynb

chore(test): Add E2E tests for Kubeflow Trainer #2470

chore(test): Add E2E tests for Kubeflow Trainer #2470

Conversation

andreyvelich commented Mar 3, 2025 • edited Loading

review-notebook-app bot commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Mar 4, 2025

andreyvelich commented Mar 4, 2025 • edited Loading

andreyvelich commented Mar 5, 2025

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Mar 5, 2025

Electronic-Waste commented Mar 5, 2025

tenzen-y commented Mar 5, 2025

tenzen-y commented Mar 5, 2025

Electronic-Waste left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Mar 5, 2025

andreyvelich commented Mar 3, 2025 •

edited

Loading

andreyvelich commented Mar 4, 2025 •

edited

Loading