Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(test): Add E2E tests for Kubeflow Trainer #2470

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Mar 3, 2025

Fixes: #2213

This PR adds E2E tests for:

  1. Simple TrainJob creation that reference existing torch runtime
  2. Test for MNIST Notebook

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 4, 2025
@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 38ca1e6 to 957d8af Compare March 4, 2025 01:30
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Mar 4, 2025
@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch from da2d9d8 to 5af466d Compare March 4, 2025 02:01
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 4, 2025
@@ -122,6 +122,25 @@ def get_resources_per_node(resources_per_node: dict) -> client.V1ResourceRequire
return resources


# TODO (andreyvelich): Change return type to IntOrString.
def get_num_proc_per_node(resources_per_node: dict) -> object:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti I added this util func to get numProcPerNode based on CPU and GPU configuration, as we discussed before.
WDYT ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich ah that's interesting, intuitively I would have defined that logic backend side in the torch plugin.
Not all the runtimes may have that "auto" logic, and some may be cgroups friendly.

Also the capping logic should take into account fractional CPU and millicore request / limit, which may be easier to achieved controller-side.

Adding #2407 for reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is good point, yeah maybe we could move this part to the plugin itself.
Let's add the TODO statement to refactor it in the followup PRs, so we can integrate initial E2E tests working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I agree with you, better integrate E2E asap, and we can work on #2407 separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, we can construct Torch environment variables based on input numProcPerNode as we do in the current torch plugin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, what we try to say here is if user doesn't explicitly set numProcPerNode in TrainJob, we should construct this value based on Trainer container resources (if they are configured).

And we should do it inside the torch plugin on the server side, not in the get_num_proc_per_node() function on the client side.

Does it make sense @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for your confusion. That is what I wanted to say.

@andreyvelich andreyvelich changed the title [WIP] chore(test): Add E2E tests for Kubeflow Trainer chore(test): Add E2E tests for Kubeflow Trainer Mar 4, 2025
@andreyvelich
Copy link
Member Author

This should be ready for review.
/assign @kubeflow/wg-training-leads @Electronic-Waste @astefanutti @seanlaii @saileshd1402

@andreyvelich
Copy link
Member Author

andreyvelich commented Mar 4, 2025

FYI, I was able to use large GitHub runners in our E2Es, after enabling it for kubeflow/trainer repo 🎉

runs-on:
  labels: ubuntu-latest-16-cores

I will create issue to inform Kubeflow community that we can use these runners.

@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch 2 times, most recently from 0f60a1f to a036b42 Compare March 4, 2025 15:16
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch from b577e24 to 761a9fa Compare March 4, 2025 18:18
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Export Notebook as artifact

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 62f0611 to 657c36e Compare March 5, 2025 01:09
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2213-e2e-notebooks branch from 6835a6e to 188aeee Compare March 5, 2025 03:15
@andreyvelich
Copy link
Member Author

@tenzen-y Hopefully, I fixed all of the test flakiness. Please take a look when you can.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

Comment on lines 42 to 44
echo "Install Kind"
go install sigs.k8s.io/[email protected]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Install Kind"
go install sigs.k8s.io/[email protected]

You might forget to remove this.

Comment on lines 23 to 28
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
"github.com/onsi/ginkgo/v2"
"github.com/onsi/gomega"
"k8s.io/client-go/kubernetes/scheme"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/config"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
"github.com/onsi/ginkgo/v2"
"github.com/onsi/gomega"
"k8s.io/client-go/kubernetes/scheme"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/config"
"github.com/onsi/ginkgo/v2"
"github.com/onsi/gomega"
"k8s.io/client-go/kubernetes/scheme"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/config"
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this! Just a few comments.

Comment on lines +21 to +22
# Kubernetes versions for e2e tests on Kind cluster.
kubernetes-version: ["1.29.14", "1.30.0", "1.31.0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we support 1.32 since we plan to upgrade the Kubernetes version: #2448 ? Do we need to drop support for v1.29 after that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is out of the scope of this PR. #2448 contributor should address this.

Comment on lines +34 to +36
# TODO (andreyvelich): Update JobSet to v0.8.0 once this PR is merged: https://github.com/kubeflow/trainer/pull/2466
# "pydantic>=2.10.0",
"jobset @ git+https://github.com/kubernetes-sigs/[email protected]#subdirectory=sdk/python",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use v0.7.3 since we used this version before: #2445

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let me ping it directly to updated commit, once we merge this PR:
kubernetes-sigs/jobset#810

Comment on lines +144 to +150
# Input and output location for Notebooks executed with Papermill.
NOTEBOOK_INPUT=$(PROJECT_DIR)/examples/pytorch/image-classification/mnist.ipynb
NOTEBOOK_OUTPUT=$(PROJECT_DIR)/trainer_output.ipynb
PAPERMILL_TIMEOUT=900
.PHONY: test-e2e-notebook
test-e2e-notebook: ## Run Jupyter Notebook with Papermill.
NOTEBOOK_INPUT=$(NOTEBOOK_INPUT) NOTEBOOK_OUTPUT=$(NOTEBOOK_OUTPUT) PAPERMILL_TIMEOUT=$(PAPERMILL_TIMEOUT) ./hack/e2e-run-notebook.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we use NOTEBOOK_INPUT_LIST and NOTEBOOK_OUTPUT_LIST for e2e test with Notebook. We may add more testcases in the future:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you mentioned future work. So, we can keep current approach for now.
There are no guarantees and plans for additional tests for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree @Electronic-Waste, but I want to discuss in the future how we can parallelize these steps, so we can run multiple Notebooks at the same time.
As @tenzen-y suggested, let's discuss it as a followup.

Comment on lines +16 to +17
# The default output for Notebook after Papermill execution.
trainer_output.ipynb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing it to *_output.ipynb?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in #2470 (comment), let's add artifacts to gitignore.

Comment on lines +63 to +70
# TODO (andreyvelich): Discuss how we can upload artifacts for multiple Notebooks.
- name: Upload notebook
uses: actions/upload-artifact@v4
if: always()
with:
name: mnist_output_${{ matrix.kubernetes-version }}.ipynb
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/mnist_output_${{ matrix.kubernetes-version }}.ipynb
retention-days: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option1: Output files to a seperate directory
We can output these *_output.ipynb files to a certain directory like ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output, and upload the whole dir like:

- name: Upload all artifacts
  uses: actions/upload-artifact@v4
  if: always()
  with:
    name: all-artifacts
    path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output/*
    retention-days: 1

Option2: regex match

- name: Upload all artifacts
  uses: actions/upload-artifact@v4
  if: always()
  with:
    name: all-artifacts
    path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/*_output_${{ matrix.kubernetes-version }}.ipynb
    retention-days: 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of output, it might be better to use artifacts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I didn't know that upload artifact supports regex.
Let me create PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 5, 2025

@Electronic-Waste Could you address in the following in your side? If yes, could you say /hold cancel, then open PR for those?

/approve
/lgtm
/hold

@Electronic-Waste
Copy link
Member

@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:)

/hold cancel

@tenzen-y
Copy link
Member

tenzen-y commented Mar 5, 2025

@Electronic-Waste Could you address in the following in your side? If yes, could you say /hold cancel, then open PR for those?

/approve /lgtm /hold

Otherwise, Andrey will address those in this PR.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 5, 2025

@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:)

/hold cancel

Thanks!

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!
/lgtm

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Electronic-Waste, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 9e78575 into kubeflow:master Mar 5, 2025
14 checks passed
@andreyvelich andreyvelich deleted the issue-2213-e2e-notebooks branch March 5, 2025 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add E2E tests for Kubeflow Training V2
4 participants