Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mnist freezes on test with ROCM #1292

Open
jlo62 opened this issue Oct 11, 2024 · 0 comments
Open

mnist freezes on test with ROCM #1292

jlo62 opened this issue Oct 11, 2024 · 0 comments

Comments

@jlo62
Copy link

jlo62 commented Oct 11, 2024

Context

  • Pytorch version: 2.3.1
  • Operating System and version: Arch Linux

Your Environment

  • Installed using source? [yes/no]: yes (via AUR)
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: gpu/ROCM (7800 xt)
  • Which example are you using: mnist
  • Link to code or data to repro [if any]: https://github.com/pytorch/examples/blob/main/mnist/main.py

Expected Behavior

The trained data should be tested

Current Behavior

When it should Test, it instead hogs on a single cpu thread.
This happens here, in test(), lines 57-65:

    with torch.no_grad():
        for data, target in test_loader:
            print(1)
            data, target = data.to(device), target.to(device)
            print(2)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

It happens between print(1) and print(2)\

I then kill it with pkill pt_main_thread
Setting the test batch size to low does not help.

Possible Solution

--no-cuda flag or ROCR_VISIBLE_DEVICES=2 to run it on cpu

Failure Logs [if any]

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.279597
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.216242
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.935520
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.621186
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.459617
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.555883
Train Epoch: 1 [3840/60000 (6%)]	Loss: 0.248135
Train Epoch: 1 [4480/60000 (7%)]	Loss: 0.476440
Train Epoch: 1 [5120/60000 (9%)]	Loss: 0.286069
Train Epoch: 1 [5760/60000 (10%)]	Loss: 0.101378
Train Epoch: 1 [6400/60000 (11%)]	Loss: 0.317981
Train Epoch: 1 [7040/60000 (12%)]	Loss: 0.234222
Train Epoch: 1 [7680/60000 (13%)]	Loss: 0.310746
Train Epoch: 1 [8320/60000 (14%)]	Loss: 0.122714
Train Epoch: 1 [8960/60000 (15%)]	Loss: 0.456426
Train Epoch: 1 [9600/60000 (16%)]	Loss: 0.074296
Train Epoch: 1 [10240/60000 (17%)]	Loss: 0.261630
Train Epoch: 1 [10880/60000 (18%)]	Loss: 0.238516
Train Epoch: 1 [11520/60000 (19%)]	Loss: 0.173536
Train Epoch: 1 [12160/60000 (20%)]	Loss: 0.169779
Train Epoch: 1 [12800/60000 (21%)]	Loss: 0.045510
Train Epoch: 1 [13440/60000 (22%)]	Loss: 0.205859
Train Epoch: 1 [14080/60000 (23%)]	Loss: 0.195058
Train Epoch: 1 [14720/60000 (25%)]	Loss: 0.140971
Train Epoch: 1 [15360/60000 (26%)]	Loss: 0.262293
Train Epoch: 1 [16000/60000 (27%)]	Loss: 0.285171
Train Epoch: 1 [16640/60000 (28%)]	Loss: 0.098628
Train Epoch: 1 [17280/60000 (29%)]	Loss: 0.163876
Train Epoch: 1 [17920/60000 (30%)]	Loss: 0.131609
Train Epoch: 1 [18560/60000 (31%)]	Loss: 0.172449
Train Epoch: 1 [19200/60000 (32%)]	Loss: 0.131192
Train Epoch: 1 [19840/60000 (33%)]	Loss: 0.089265
Train Epoch: 1 [20480/60000 (34%)]	Loss: 0.200241
Train Epoch: 1 [21120/60000 (35%)]	Loss: 0.116003
Train Epoch: 1 [21760/60000 (36%)]	Loss: 0.337610
Train Epoch: 1 [22400/60000 (37%)]	Loss: 0.177359
Train Epoch: 1 [23040/60000 (38%)]	Loss: 0.181004
Train Epoch: 1 [23680/60000 (39%)]	Loss: 0.109945
Train Epoch: 1 [24320/60000 (41%)]	Loss: 0.126567
Train Epoch: 1 [24960/60000 (42%)]	Loss: 0.081637
Train Epoch: 1 [25600/60000 (43%)]	Loss: 0.118572
Train Epoch: 1 [26240/60000 (44%)]	Loss: 0.262203
Train Epoch: 1 [26880/60000 (45%)]	Loss: 0.266514
Train Epoch: 1 [27520/60000 (46%)]	Loss: 0.025646
Train Epoch: 1 [28160/60000 (47%)]	Loss: 0.238066
Train Epoch: 1 [28800/60000 (48%)]	Loss: 0.017015
Train Epoch: 1 [29440/60000 (49%)]	Loss: 0.128963
Train Epoch: 1 [30080/60000 (50%)]	Loss: 0.084565
Train Epoch: 1 [30720/60000 (51%)]	Loss: 0.141485
Train Epoch: 1 [31360/60000 (52%)]	Loss: 0.109501
Train Epoch: 1 [32000/60000 (53%)]	Loss: 0.228396
Train Epoch: 1 [32640/60000 (54%)]	Loss: 0.028802
Train Epoch: 1 [33280/60000 (55%)]	Loss: 0.093304
Train Epoch: 1 [33920/60000 (57%)]	Loss: 0.187867
Train Epoch: 1 [34560/60000 (58%)]	Loss: 0.078651
Train Epoch: 1 [35200/60000 (59%)]	Loss: 0.100239
Train Epoch: 1 [35840/60000 (60%)]	Loss: 0.065758
Train Epoch: 1 [36480/60000 (61%)]	Loss: 0.159857
Train Epoch: 1 [37120/60000 (62%)]	Loss: 0.068338
Train Epoch: 1 [37760/60000 (63%)]	Loss: 0.116931
Train Epoch: 1 [38400/60000 (64%)]	Loss: 0.108750
Train Epoch: 1 [39040/60000 (65%)]	Loss: 0.067337
Train Epoch: 1 [39680/60000 (66%)]	Loss: 0.514672
Train Epoch: 1 [40320/60000 (67%)]	Loss: 0.139609
Train Epoch: 1 [40960/60000 (68%)]	Loss: 0.125796
Train Epoch: 1 [41600/60000 (69%)]	Loss: 0.301703
Train Epoch: 1 [42240/60000 (70%)]	Loss: 0.078540
Train Epoch: 1 [42880/60000 (71%)]	Loss: 0.149661
Train Epoch: 1 [43520/60000 (72%)]	Loss: 0.038693
Train Epoch: 1 [44160/60000 (74%)]	Loss: 0.050987
Train Epoch: 1 [44800/60000 (75%)]	Loss: 0.065854
Train Epoch: 1 [45440/60000 (76%)]	Loss: 0.253564
Train Epoch: 1 [46080/60000 (77%)]	Loss: 0.044726
Train Epoch: 1 [46720/60000 (78%)]	Loss: 0.076648
Train Epoch: 1 [47360/60000 (79%)]	Loss: 0.166157
Train Epoch: 1 [48000/60000 (80%)]	Loss: 0.081918
Train Epoch: 1 [48640/60000 (81%)]	Loss: 0.243725
Train Epoch: 1 [49280/60000 (82%)]	Loss: 0.031923
Train Epoch: 1 [49920/60000 (83%)]	Loss: 0.099474
Train Epoch: 1 [50560/60000 (84%)]	Loss: 0.082273
Train Epoch: 1 [51200/60000 (85%)]	Loss: 0.081125
Train Epoch: 1 [51840/60000 (86%)]	Loss: 0.114273
Train Epoch: 1 [52480/60000 (87%)]	Loss: 0.197501
Train Epoch: 1 [53120/60000 (88%)]	Loss: 0.020628
Train Epoch: 1 [53760/60000 (90%)]	Loss: 0.080297
Train Epoch: 1 [54400/60000 (91%)]	Loss: 0.180997
Train Epoch: 1 [55040/60000 (92%)]	Loss: 0.324929
Train Epoch: 1 [55680/60000 (93%)]	Loss: 0.116702
Train Epoch: 1 [56320/60000 (94%)]	Loss: 0.189182
Train Epoch: 1 [56960/60000 (95%)]	Loss: 0.097195
Train Epoch: 1 [57600/60000 (96%)]	Loss: 0.022219
Train Epoch: 1 [58240/60000 (97%)]	Loss: 0.181135
Train Epoch: 1 [58880/60000 (98%)]	Loss: 0.042285
Train Epoch: 1 [59520/60000 (99%)]	Loss: 0.108003
1
zsh: terminated  ROCR_VISIBLE_DEVICES=1 python main.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant