You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My training-operator pod's /metrics endpoint is not reporting the Prometheus metrics mentioned here properly.
To be precise, training_operator_jobs_created_total is being incremented as expected — the issue (thus far) has been with training_operator_jobs_successful_total and training_operator_jobs_deleted_total .
I started a PyTorch training job using this code (closely based on the guide here):
def train_func():
import torch
import torch.nn.functional as F
from torch.utils.data import DistributedSampler
from torchvision import datasets, transforms
import torch.distributed as dist
import os
# [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
dist.init_process_group(backend="gloo")
Distributor = torch.nn.parallel.DistributedDataParallel
local_rank = int(os.getenv("LOCAL_RANK", 0))
print(
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
dist.get_world_size(),
dist.get_rank(),
local_rank,
)
)
# [2] Create PyTorch CNN Model.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
self.fc2 = torch.nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4 * 4 * 50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
# [3] Attach model to the correct GPU device and distributor.
device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
model = Distributor(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
# [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
dataset = datasets.FashionMNIST(
"./data",
download=True,
train=True,
transform=transforms.Compose([transforms.ToTensor()]),
)
train_loader = torch.utils.data.DataLoader(
dataset=dataset,
batch_size=128,
sampler=DistributedSampler(dataset),
)
# [5] Start model Training.
for epoch in range(3):
for batch_idx, (data, target) in enumerate(train_loader):
# Attach Tensors to the device.
data = data.to(device)
target = target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0 and dist.get_rank() == 0:
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch,
batch_idx * len(data),
len(train_loader.dataset),
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
from kubeflow.training import TrainingClient
training_client = TrainingClient()
job_name = "pytorch-ddp"
# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
name=job_name,
train_func=train_func,
num_procs_per_worker="auto",
num_workers=3,
base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
# resources_per_worker={"gpu": "1"},
)
This job completed and was successful:
Code:
training_client.is_job_succeeded(name=job_name)
Output:
True
However, when visiting the /metrics endpoint, I could only see that training_operator_jobs_created_total was at 1 — neither training_operator_jobs_successful_total nor training_operator_jobs_failed_total had been incremented (neither one was present on the page).
So, I deleted this job to try again:
training_client.delete_job(job_name)
Still, on the /metrics endpoint, training_operator_jobs_deleted_total was not incremented/visible.
I created a job again with the same code and repeated the process. From this point on, training_operator_jobs_successful_total was incremented as expected. However, training_operator_jobs_deleted_total is still failing to update.
I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.
What did you expect to happen?
I expected training_operator_jobs_successful_total to be incremented if a job is deemed successful by the Python client.
I expected training_operator_jobs_deleted_total to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in the training-operator pod logs.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.30.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"***.azurecr.io/kubeflow/training-operator:v1-5a5f92d
What happened?
My
training-operator
pod's/metrics
endpoint is not reporting the Prometheus metrics mentioned here properly.To be precise,
training_operator_jobs_created_total
is being incremented as expected — the issue (thus far) has been withtraining_operator_jobs_successful_total
andtraining_operator_jobs_deleted_total
.I started a PyTorch training job using this code (closely based on the guide here):
This job completed and was successful:
Code:
Output:
However, when visiting the
/metrics
endpoint, I could only see thattraining_operator_jobs_created_total
was at 1 — neithertraining_operator_jobs_successful_total
nortraining_operator_jobs_failed_total
had been incremented (neither one was present on the page).So, I deleted this job to try again:
Still, on the
/metrics
endpoint,training_operator_jobs_deleted_total
was not incremented/visible.I created a job again with the same code and repeated the process. From this point on,
training_operator_jobs_successful_total
was incremented as expected. However,training_operator_jobs_deleted_total
is still failing to update.I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.
What did you expect to happen?
I expected
training_operator_jobs_successful_total
to be incremented if a job is deemed successful by the Python client.I expected
training_operator_jobs_deleted_total
to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in thetraining-operator
pod logs.Environment
Kubernetes version:
Training Operator version:
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: