-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(test): Add E2E tests for Kubeflow Trainer #2470
chore(test): Add E2E tests for Kubeflow Trainer #2470
Conversation
7c96d36
to
b20e562
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
38ca1e6
to
957d8af
Compare
da2d9d8
to
5af466d
Compare
@@ -122,6 +122,25 @@ def get_resources_per_node(resources_per_node: dict) -> client.V1ResourceRequire | |||
return resources | |||
|
|||
|
|||
# TODO (andreyvelich): Change return type to IntOrString. | |||
def get_num_proc_per_node(resources_per_node: dict) -> object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti I added this util func to get numProcPerNode based on CPU and GPU configuration, as we discussed before.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich ah that's interesting, intuitively I would have defined that logic backend side in the torch plugin.
Not all the runtimes may have that "auto" logic, and some may be cgroups friendly.
Also the capping logic should take into account fractional CPU and millicore request / limit, which may be easier to achieved controller-side.
Adding #2407 for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is good point, yeah maybe we could move this part to the plugin itself.
Let's add the TODO statement to refactor it in the followup PRs, so we can integrate initial E2E tests working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I agree with you, better integrate E2E asap, and we can work on #2407 separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, we can construct Torch environment variables based on input numProcPerNode as we do in the current torch plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, what we try to say here is if user doesn't explicitly set numProcPerNode
in TrainJob, we should construct this value based on Trainer container resources (if they are configured).
And we should do it inside the torch
plugin on the server side, not in the get_num_proc_per_node()
function on the client side.
Does it make sense @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for your confusion. That is what I wanted to say.
This should be ready for review. |
FYI, I was able to use large GitHub runners in our E2Es, after enabling it for
I will create issue to inform Kubeflow community that we can use these runners. |
0f60a1f
to
a036b42
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
a036b42
to
7d2b51f
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
b577e24
to
761a9fa
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Export Notebook as artifact Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
62f0611
to
657c36e
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
6835a6e
to
188aeee
Compare
@tenzen-y Hopefully, I fixed all of the test flakiness. Please take a look when you can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
.github/workflows/test-e2e.yaml
Outdated
echo "Install Kind" | ||
go install sigs.k8s.io/[email protected] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "Install Kind" | |
go install sigs.k8s.io/[email protected] |
You might forget to remove this.
test/e2e/suite_test.go
Outdated
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" | ||
"github.com/onsi/ginkgo/v2" | ||
"github.com/onsi/gomega" | ||
"k8s.io/client-go/kubernetes/scheme" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
"sigs.k8s.io/controller-runtime/pkg/client/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" | |
"github.com/onsi/ginkgo/v2" | |
"github.com/onsi/gomega" | |
"k8s.io/client-go/kubernetes/scheme" | |
"sigs.k8s.io/controller-runtime/pkg/client" | |
"sigs.k8s.io/controller-runtime/pkg/client/config" | |
"github.com/onsi/ginkgo/v2" | |
"github.com/onsi/gomega" | |
"k8s.io/client-go/kubernetes/scheme" | |
"sigs.k8s.io/controller-runtime/pkg/client" | |
"sigs.k8s.io/controller-runtime/pkg/client/config" | |
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" |
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this! Just a few comments.
# Kubernetes versions for e2e tests on Kind cluster. | ||
kubernetes-version: ["1.29.14", "1.30.0", "1.31.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we support 1.32 since we plan to upgrade the Kubernetes version: #2448 ? Do we need to drop support for v1.29 after that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is out of the scope of this PR. #2448 contributor should address this.
# TODO (andreyvelich): Update JobSet to v0.8.0 once this PR is merged: https://github.com/kubeflow/trainer/pull/2466 | ||
# "pydantic>=2.10.0", | ||
"jobset @ git+https://github.com/kubernetes-sigs/[email protected]#subdirectory=sdk/python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use v0.7.3 since we used this version before: #2445
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let me ping it directly to updated commit, once we merge this PR:
kubernetes-sigs/jobset#810
# Input and output location for Notebooks executed with Papermill. | ||
NOTEBOOK_INPUT=$(PROJECT_DIR)/examples/pytorch/image-classification/mnist.ipynb | ||
NOTEBOOK_OUTPUT=$(PROJECT_DIR)/trainer_output.ipynb | ||
PAPERMILL_TIMEOUT=900 | ||
.PHONY: test-e2e-notebook | ||
test-e2e-notebook: ## Run Jupyter Notebook with Papermill. | ||
NOTEBOOK_INPUT=$(NOTEBOOK_INPUT) NOTEBOOK_OUTPUT=$(NOTEBOOK_OUTPUT) PAPERMILL_TIMEOUT=$(PAPERMILL_TIMEOUT) ./hack/e2e-run-notebook.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we use NOTEBOOK_INPUT_LIST
and NOTEBOOK_OUTPUT_LIST
for e2e test with Notebook. We may add more testcases in the future:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you mentioned future work. So, we can keep current approach for now.
There are no guarantees and plans for additional tests for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree @Electronic-Waste, but I want to discuss in the future how we can parallelize these steps, so we can run multiple Notebooks at the same time.
As @tenzen-y suggested, let's discuss it as a followup.
# The default output for Notebook after Papermill execution. | ||
trainer_output.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing it to *_output.ipynb
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned in #2470 (comment), let's add artifacts
to gitignore.
# TODO (andreyvelich): Discuss how we can upload artifacts for multiple Notebooks. | ||
- name: Upload notebook | ||
uses: actions/upload-artifact@v4 | ||
if: always() | ||
with: | ||
name: mnist_output_${{ matrix.kubernetes-version }}.ipynb | ||
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/mnist_output_${{ matrix.kubernetes-version }}.ipynb | ||
retention-days: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option1: Output files to a seperate directory
We can output these *_output.ipynb
files to a certain directory like ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output
, and upload the whole dir like:
- name: Upload all artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: all-artifacts
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output/*
retention-days: 1
Option2: regex match
- name: Upload all artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: all-artifacts
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/*_output_${{ matrix.kubernetes-version }}.ipynb
retention-days: 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of output
, it might be better to use artifacts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! I didn't know that upload artifact supports regex.
Let me create PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich You could check this: https://github.com/actions/upload-artifact#upload-using-a-wildcard-pattern
@Electronic-Waste Could you address in the following in your side? If yes, could you say
/approve |
@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:) /hold cancel |
Otherwise, Andrey will address those in this PR. |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Electronic-Waste, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #2213
This PR adds E2E tests for: