Lightning-Universe/Training-Studio_app

Training progress is 0 is UI not showing progress

edenlightning opened this issue · 0 comments

Running this script: https://github.com/Lightning-AI/grid-tutorials/blob/main/Hello-Cifar-10/pl_cifar10.py

With command
lightning run sweep pl_cifar10.py --requirements=requirements.txt

I can see that my script is training

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.10.0 at http://127.0.0.1:61183/ (Press CTRL+C to quit)
Injecting Tensorboard
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /Users/edenafek/studio4/lightning-hpo/cifar-10-python.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170498071/170498071 [00:03<00:00, 53990352.77it/s]
Extracting /Users/edenafek/studio4/lightning-hpo/cifar-10-python.tar.gz to /Users/edenafek/studio4/lightning-hpo
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name    | Type   | Params
-----------------------------------
0 | layer_1 | Linear | 393 K 
1 | layer_2 | Linear | 1.3 K 
-----------------------------------
394 K     Trainable params
0         Non-trainable params
394 K     Total params
1.579     Total estimated model params size (MB)
/Users/edenafek/studio4/lightning/src/pytorch_lightning/trainer/connectors/data_connector.py:230: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                                                                                                                                                                               | 0/1563 [00:00<?, ?it/s]pl_cifar10.py:24: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  x = F.log_softmax(x)
Epoch 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1563/1563 [00:12<00:00, 125.38it/s, loss=1.57, v_num=, train_acc=0.312]`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1563/1563 [00:12<00:00, 125.33it/s, loss=1.57, v_num=, train_acc=0.312]
INFO: Received SIGTERM signal. Gracefully terminating sweep_controller.r.edenafek-8500fd97.w_0.ws.0...

but not updated in UI

image

ן