Training progress is 0 is UI not showing progress
edenlightning opened this issue · 0 comments
edenlightning commented
Running this script: https://github.com/Lightning-AI/grid-tutorials/blob/main/Hello-Cifar-10/pl_cifar10.py
With command
lightning run sweep pl_cifar10.py --requirements=requirements.txt
I can see that my script is training
NOTE: Using experimental fast data loading logic. To disable, pass
"--load_fast=false" and report issues on GitHub. More details:
https://github.com/tensorflow/tensorboard/issues/4784
TensorBoard 2.10.0 at http://127.0.0.1:61183/ (Press CTRL+C to quit)
Injecting Tensorboard
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /Users/edenafek/studio4/lightning-hpo/cifar-10-python.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170498071/170498071 [00:03<00:00, 53990352.77it/s]
Extracting /Users/edenafek/studio4/lightning-hpo/cifar-10-python.tar.gz to /Users/edenafek/studio4/lightning-hpo
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
| Name | Type | Params
-----------------------------------
0 | layer_1 | Linear | 393 K
1 | layer_2 | Linear | 1.3 K
-----------------------------------
394 K Trainable params
0 Non-trainable params
394 K Total params
1.579 Total estimated model params size (MB)
/Users/edenafek/studio4/lightning/src/pytorch_lightning/trainer/connectors/data_connector.py:230: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/1563 [00:00<?, ?it/s]pl_cifar10.py:24: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
x = F.log_softmax(x)
Epoch 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1563/1563 [00:12<00:00, 125.38it/s, loss=1.57, v_num=, train_acc=0.312]`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1563/1563 [00:12<00:00, 125.33it/s, loss=1.57, v_num=, train_acc=0.312]
INFO: Received SIGTERM signal. Gracefully terminating sweep_controller.r.edenafek-8500fd97.w_0.ws.0...
but not updated in UI
ן