Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Prometheus metrics not being reported properly #2408

Open
ishaan-mehta opened this issue Jan 31, 2025 · 0 comments
Open

Some Prometheus metrics not being reported properly #2408

ishaan-mehta opened this issue Jan 31, 2025 · 0 comments

Comments

@ishaan-mehta
Copy link

What happened?

My training-operator pod's /metrics endpoint is not reporting the Prometheus metrics mentioned here properly.

To be precise, training_operator_jobs_created_total is being incremented as expected — the issue (thus far) has been with training_operator_jobs_successful_total and training_operator_jobs_deleted_total .

I started a PyTorch training job using this code (closely based on the guide here):

def train_func():
    import torch
    import torch.nn.functional as F
    from torch.utils.data import DistributedSampler
    from torchvision import datasets, transforms
    import torch.distributed as dist
    import os

    # [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
    dist.init_process_group(backend="gloo")
    Distributor = torch.nn.parallel.DistributedDataParallel
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    print(
        "Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
            dist.get_world_size(),
            dist.get_rank(),
            local_rank,
        )
    )

    # [2] Create PyTorch CNN Model.
    class Net(torch.nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
            self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
            self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
            self.fc2 = torch.nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # [3] Attach model to the correct GPU device and distributor.
    device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu")
    model = Net().to(device)
    model = Distributor(model)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    # [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
    dataset = datasets.FashionMNIST(
        "./data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=128,
        sampler=DistributedSampler(dataset),
    )

    # [5] Start model Training.
    for epoch in range(3):
        for batch_idx, (data, target) in enumerate(train_loader):
            # Attach Tensors to the device.
            data = data.to(device)
            target = target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )


from kubeflow.training import TrainingClient

training_client = TrainingClient()
job_name = "pytorch-ddp"

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker="auto",
    num_workers=3,
    base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
    # resources_per_worker={"gpu": "1"},
)

This job completed and was successful:
Code:

training_client.is_job_succeeded(name=job_name)

Output:

True

However, when visiting the /metrics endpoint, I could only see that training_operator_jobs_created_total was at 1 — neither training_operator_jobs_successful_total nor training_operator_jobs_failed_total had been incremented (neither one was present on the page).

So, I deleted this job to try again:

training_client.delete_job(job_name)

Still, on the /metrics endpoint, training_operator_jobs_deleted_total was not incremented/visible.

I created a job again with the same code and repeated the process. From this point on, training_operator_jobs_successful_total was incremented as expected. However, training_operator_jobs_deleted_total is still failing to update.

I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.

What did you expect to happen?

I expected training_operator_jobs_successful_total to be incremented if a job is deemed successful by the Python client.

I expected training_operator_jobs_deleted_total to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in the training-operator pod logs.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.30.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d

Training Operator Python SDK version:

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.9.0rc0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant