mnist freezes on test with ROCM
Opened this issue · 0 comments
jlo62 commented
Context
- Pytorch version: 2.3.1
- Operating System and version: Arch Linux
Your Environment
- Installed using source? [yes/no]: yes (via AUR)
- Are you planning to deploy it using docker container? [yes/no]: no
- Is it a CPU or GPU environment?: gpu/ROCM (7800 xt)
- Which example are you using: mnist
- Link to code or data to repro [if any]: https://github.com/pytorch/examples/blob/main/mnist/main.py
Expected Behavior
The trained data should be tested
Current Behavior
When it should Test, it instead hogs on a single cpu thread.
This happens here, in test()
, lines 57-65:
with torch.no_grad():
for data, target in test_loader:
print(1)
data, target = data.to(device), target.to(device)
print(2)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
It happens between print(1)
and print(2)
\
I then kill it with pkill pt_main_thread
Setting the test batch size to low does not help.
Possible Solution
--no-cuda
flag or ROCR_VISIBLE_DEVICES=2
to run it on cpu
Failure Logs [if any]
Train Epoch: 1 [0/60000 (0%)] Loss: 2.279597
Train Epoch: 1 [640/60000 (1%)] Loss: 1.216242
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.935520
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.621186
Train Epoch: 1 [2560/60000 (4%)] Loss: 0.459617
Train Epoch: 1 [3200/60000 (5%)] Loss: 0.555883
Train Epoch: 1 [3840/60000 (6%)] Loss: 0.248135
Train Epoch: 1 [4480/60000 (7%)] Loss: 0.476440
Train Epoch: 1 [5120/60000 (9%)] Loss: 0.286069
Train Epoch: 1 [5760/60000 (10%)] Loss: 0.101378
Train Epoch: 1 [6400/60000 (11%)] Loss: 0.317981
Train Epoch: 1 [7040/60000 (12%)] Loss: 0.234222
Train Epoch: 1 [7680/60000 (13%)] Loss: 0.310746
Train Epoch: 1 [8320/60000 (14%)] Loss: 0.122714
Train Epoch: 1 [8960/60000 (15%)] Loss: 0.456426
Train Epoch: 1 [9600/60000 (16%)] Loss: 0.074296
Train Epoch: 1 [10240/60000 (17%)] Loss: 0.261630
Train Epoch: 1 [10880/60000 (18%)] Loss: 0.238516
Train Epoch: 1 [11520/60000 (19%)] Loss: 0.173536
Train Epoch: 1 [12160/60000 (20%)] Loss: 0.169779
Train Epoch: 1 [12800/60000 (21%)] Loss: 0.045510
Train Epoch: 1 [13440/60000 (22%)] Loss: 0.205859
Train Epoch: 1 [14080/60000 (23%)] Loss: 0.195058
Train Epoch: 1 [14720/60000 (25%)] Loss: 0.140971
Train Epoch: 1 [15360/60000 (26%)] Loss: 0.262293
Train Epoch: 1 [16000/60000 (27%)] Loss: 0.285171
Train Epoch: 1 [16640/60000 (28%)] Loss: 0.098628
Train Epoch: 1 [17280/60000 (29%)] Loss: 0.163876
Train Epoch: 1 [17920/60000 (30%)] Loss: 0.131609
Train Epoch: 1 [18560/60000 (31%)] Loss: 0.172449
Train Epoch: 1 [19200/60000 (32%)] Loss: 0.131192
Train Epoch: 1 [19840/60000 (33%)] Loss: 0.089265
Train Epoch: 1 [20480/60000 (34%)] Loss: 0.200241
Train Epoch: 1 [21120/60000 (35%)] Loss: 0.116003
Train Epoch: 1 [21760/60000 (36%)] Loss: 0.337610
Train Epoch: 1 [22400/60000 (37%)] Loss: 0.177359
Train Epoch: 1 [23040/60000 (38%)] Loss: 0.181004
Train Epoch: 1 [23680/60000 (39%)] Loss: 0.109945
Train Epoch: 1 [24320/60000 (41%)] Loss: 0.126567
Train Epoch: 1 [24960/60000 (42%)] Loss: 0.081637
Train Epoch: 1 [25600/60000 (43%)] Loss: 0.118572
Train Epoch: 1 [26240/60000 (44%)] Loss: 0.262203
Train Epoch: 1 [26880/60000 (45%)] Loss: 0.266514
Train Epoch: 1 [27520/60000 (46%)] Loss: 0.025646
Train Epoch: 1 [28160/60000 (47%)] Loss: 0.238066
Train Epoch: 1 [28800/60000 (48%)] Loss: 0.017015
Train Epoch: 1 [29440/60000 (49%)] Loss: 0.128963
Train Epoch: 1 [30080/60000 (50%)] Loss: 0.084565
Train Epoch: 1 [30720/60000 (51%)] Loss: 0.141485
Train Epoch: 1 [31360/60000 (52%)] Loss: 0.109501
Train Epoch: 1 [32000/60000 (53%)] Loss: 0.228396
Train Epoch: 1 [32640/60000 (54%)] Loss: 0.028802
Train Epoch: 1 [33280/60000 (55%)] Loss: 0.093304
Train Epoch: 1 [33920/60000 (57%)] Loss: 0.187867
Train Epoch: 1 [34560/60000 (58%)] Loss: 0.078651
Train Epoch: 1 [35200/60000 (59%)] Loss: 0.100239
Train Epoch: 1 [35840/60000 (60%)] Loss: 0.065758
Train Epoch: 1 [36480/60000 (61%)] Loss: 0.159857
Train Epoch: 1 [37120/60000 (62%)] Loss: 0.068338
Train Epoch: 1 [37760/60000 (63%)] Loss: 0.116931
Train Epoch: 1 [38400/60000 (64%)] Loss: 0.108750
Train Epoch: 1 [39040/60000 (65%)] Loss: 0.067337
Train Epoch: 1 [39680/60000 (66%)] Loss: 0.514672
Train Epoch: 1 [40320/60000 (67%)] Loss: 0.139609
Train Epoch: 1 [40960/60000 (68%)] Loss: 0.125796
Train Epoch: 1 [41600/60000 (69%)] Loss: 0.301703
Train Epoch: 1 [42240/60000 (70%)] Loss: 0.078540
Train Epoch: 1 [42880/60000 (71%)] Loss: 0.149661
Train Epoch: 1 [43520/60000 (72%)] Loss: 0.038693
Train Epoch: 1 [44160/60000 (74%)] Loss: 0.050987
Train Epoch: 1 [44800/60000 (75%)] Loss: 0.065854
Train Epoch: 1 [45440/60000 (76%)] Loss: 0.253564
Train Epoch: 1 [46080/60000 (77%)] Loss: 0.044726
Train Epoch: 1 [46720/60000 (78%)] Loss: 0.076648
Train Epoch: 1 [47360/60000 (79%)] Loss: 0.166157
Train Epoch: 1 [48000/60000 (80%)] Loss: 0.081918
Train Epoch: 1 [48640/60000 (81%)] Loss: 0.243725
Train Epoch: 1 [49280/60000 (82%)] Loss: 0.031923
Train Epoch: 1 [49920/60000 (83%)] Loss: 0.099474
Train Epoch: 1 [50560/60000 (84%)] Loss: 0.082273
Train Epoch: 1 [51200/60000 (85%)] Loss: 0.081125
Train Epoch: 1 [51840/60000 (86%)] Loss: 0.114273
Train Epoch: 1 [52480/60000 (87%)] Loss: 0.197501
Train Epoch: 1 [53120/60000 (88%)] Loss: 0.020628
Train Epoch: 1 [53760/60000 (90%)] Loss: 0.080297
Train Epoch: 1 [54400/60000 (91%)] Loss: 0.180997
Train Epoch: 1 [55040/60000 (92%)] Loss: 0.324929
Train Epoch: 1 [55680/60000 (93%)] Loss: 0.116702
Train Epoch: 1 [56320/60000 (94%)] Loss: 0.189182
Train Epoch: 1 [56960/60000 (95%)] Loss: 0.097195
Train Epoch: 1 [57600/60000 (96%)] Loss: 0.022219
Train Epoch: 1 [58240/60000 (97%)] Loss: 0.181135
Train Epoch: 1 [58880/60000 (98%)] Loss: 0.042285
Train Epoch: 1 [59520/60000 (99%)] Loss: 0.108003
1
zsh: terminated ROCR_VISIBLE_DEVICES=1 python main.py