-
Notifications
You must be signed in to change notification settings - Fork 738
Insights: kubeflow/trainer
Overview
10 Pull requests merged by 4 people
-
feat(sdk): Generate external Kubernetes and JobSet models
#2466 merged
Mar 5, 2025 -
chore(test): Upload artifacts from dir
#2473 merged
Mar 5, 2025 -
Make MPIMLPolicySource optional fields as a pointer
#2472 merged
Mar 5, 2025 -
Implement UTs for PlainML plugin
#2469 merged
Mar 5, 2025 -
chore(test): Add E2E tests for Kubeflow Trainer
#2470 merged
Mar 5, 2025 -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design
#2439 merged
Mar 3, 2025 -
fix: fix typos in script comments.
#2465 merged
Mar 3, 2025 -
fix(sdk): resolve errors in deserialization
#2457 merged
Mar 2, 2025 -
Bump JobSet to v0.8.0
#2463 merged
Mar 1, 2025 -
Replace Kueue PodRequests helper with core k/k one
#2461 merged
Feb 28, 2025
5 Pull requests opened by 3 people
-
[feature] migrate images to ghcr
#2455 opened
Feb 27, 2025 -
Add Initialized and ComponentsCreated conditions to TrainJob API
#2464 opened
Mar 1, 2025 -
WIP: Implemente torch plugin UTs
#2471 opened
Mar 5, 2025 -
Add MPIMLPolicySource CRD defaulters
#2474 opened
Mar 5, 2025 -
Use large runner for building container image
#2475 opened
Mar 5, 2025
8 Issues closed by 1 person
-
KEP-2170: Add E2E tests for Kubeflow Training V2
#2213 closed
Mar 5, 2025 -
Upgrade jobset version to v0.8.0
#2447 closed
Mar 1, 2025 -
KEP-2170: Migrate the container resource calculation mechanism to k/k library
#2280 closed
Feb 28, 2025 -
Flaky test: [It] should delete redundant Pods
#1848 closed
Feb 28, 2025 -
Flaky test: [It] should update TFJob with desired status
#1820 closed
Feb 28, 2025 -
Flaky test: [It] should delete job when expired time is up
#1821 closed
Feb 28, 2025 -
Flaky test: [It] should delete designated Pod
#1844 closed
Feb 28, 2025 -
Flaky test: [It] should create missing Pods
#1838 closed
Feb 28, 2025
9 Issues opened by 4 people
-
Decouple UTs between Framework and Plugins packages
#2468 opened
Mar 3, 2025 -
Use JobSet DependsOn API for TrainJob
#2467 opened
Mar 3, 2025 -
Explore `uv` project manager for Kubeflow Python SDK
#2462 opened
Feb 28, 2025 -
KEP-2170: Revisit TrainJob Created condition status type
#2459 opened
Feb 28, 2025 -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Concept page to Documentation
#2458 opened
Feb 28, 2025 -
Distributed training with mutliple pods, with multi-gpu in each pod
#2456 opened
Feb 28, 2025 -
Managing Pod Lifecycle in Distributed Training with TFJob
#2454 opened
Feb 27, 2025 -
Strategies for Deleting Successful Pods without Affecting Task Execution in TFJob
#2453 opened
Feb 27, 2025
26 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add Helm chart for kubeflow trainer
#2435 commented on
Mar 5, 2025 • 55 new comments -
KEP-2401: Kubeflow LLM Trainer V2
#2410 commented on
Mar 5, 2025 • 37 new comments -
KEP-2170: Adding validation webhook for v2 trainjob
#2307 commented on
Feb 28, 2025 • 17 new comments -
fix restart policy bug in mpi job UpdateJobConditions
#2344 commented on
Mar 5, 2025 • 5 new comments -
Upgrade: K8s 1.32
#2448 commented on
Mar 4, 2025 • 4 new comments -
Add internal-cert-controller disable flag
#2426 commented on
Mar 3, 2025 • 0 new comments -
Add e2e tests for runtimes v2
#2406 commented on
Mar 2, 2025 • 0 new comments -
KEP-2170: Add the manifests overlay for Kubeflow Training V2
#2382 commented on
Mar 1, 2025 • 0 new comments -
Imporved the release process of training operator
#2359 commented on
Feb 26, 2025 • 0 new comments -
Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob
#2407 commented on
Mar 5, 2025 • 0 new comments -
Don't overwrite PYTHONUNBUFFERED in PyTorchJob (and others)
#1921 commented on
Mar 5, 2025 • 0 new comments -
Add unit tests that cover the `pkg/apply` package
#2452 commented on
Mar 5, 2025 • 0 new comments -
KEP-2170: Kubeflow Trainer V2 API
#2170 commented on
Mar 5, 2025 • 0 new comments -
mpi job bug
#2334 commented on
Mar 5, 2025 • 0 new comments -
Add migration guide from Training Operator to Kubeflow Trainer V2
#2412 commented on
Mar 4, 2025 • 0 new comments -
Create Slurm runtime for model training using V2 APIs
#2249 commented on
Mar 4, 2025 • 0 new comments -
Export Models to Kubeflow Model Registry
#2438 commented on
Feb 28, 2025 • 0 new comments -
Support Volcano Scheduler in Kubeflow Trainer
#2437 commented on
Feb 28, 2025 • 0 new comments -
Enable GPU Testing for LLM Blueprints
#2432 commented on
Feb 28, 2025 • 0 new comments -
Support TensorFlow Runtime
#2443 commented on
Feb 28, 2025 • 0 new comments -
Support JAX Runtimes
#2442 commented on
Feb 28, 2025 • 0 new comments -
Support richer volcano scheduling
#2182 commented on
Feb 28, 2025 • 0 new comments -
pytorchjob didn't create worker pod ,seems hang
#2327 commented on
Feb 28, 2025 • 0 new comments -
How can I change the default MASTER_ADDR in Pytorchjob?
#2331 commented on
Feb 28, 2025 • 0 new comments -
"zero-trust" security / networking for training jobs
#2341 commented on
Feb 27, 2025 • 0 new comments -
Migrate images in Dockerhub to GHCR
#2446 commented on
Feb 27, 2025 • 0 new comments