Trials hang when using a scheduler
dcfidalgo opened this issue · 0 comments
Hi there!
I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it.
But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.
== Status ==
Current time: 2023-03-17 10:12:33 (running for 00:00:41.50)
Memory usage on this node: 154.0/250.9 GiB
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: -1.25
Resources requested: 3.0/4 CPUs, 0/0 GPUs, 0.0/64.44 GiB heap, 0.0/31.61 GiB objects
Result logdir: /dcfidalgo/ray_results/train_func_2023-03-17_10-11-51
Number of trials: 3/3 (1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+------------+--------+------------------+------------+
| Trial name | status | loc | val_loss | iter | total time (s) | val_loss |
|------------------------+------------+---------------------+------------+--------+------------------+------------|
| train_func_c1436_00002 | RUNNING | 10.181.103.72:74356 | 3 | | | |
| train_func_c1436_00000 | TERMINATED | 10.181.103.72:74356 | 1 | 1 | 6.91809 | 1 |
| train_func_c1436_00001 | TERMINATED | 10.181.103.72:74356 | 2 | 1 | 6.20699 | 2 |
+------------------------+------------+---------------------+------------+--------+------------------+------------+
I could trace back the issue to a hanging ray.get
call when trying to get the self._master_addr
here. But I simply cannot figure out what the underlying cause is ...
A minimal script to reproduce the issue:
import torch
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray_lightning import RayStrategy
from ray_lightning.tests.utils import BoringModel, get_trainer
from ray_lightning.tune import TuneReportCallback, get_tune_resources
class AnotherBoringModel(BoringModel):
def __init__(self, val_loss: float):
super().__init__()
self._val_loss = torch.tensor(val_loss)
def validation_step(self, batch, batch_idx):
self.log("val_loss", self._val_loss)
return {"x": self._val_loss}
address_info = ray.init(num_cpus=4)
strategy = RayStrategy(num_workers=2, use_gpu=False)
callbacks = [TuneReportCallback(on="validation_end")]
def train_func(config):
model = AnotherBoringModel(config["val_loss"])
trainer = get_trainer(
"./",
callbacks=callbacks,
strategy=strategy,
checkpoint_callback=False,
max_epochs=1)
trainer.fit(model)
tune.run(
train_func,
config={"val_loss": tune.grid_search([1., 2., 3.])},
resources_per_trial=get_tune_resources(
num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
num_samples=1,
scheduler=AsyncHyperBandScheduler(metric="val_loss", mode="min")
)
If you remove the scheduler, the above script terminates without issues.
A corresponding conda env:
name: schedulerbug
channels:
- pytorch
dependencies:
- python=3.9
- pytorch==1.11.0
- cpuonly
- pip
- pip:
- pytorch-lightning==1.6.4
- ray[tune]==2.3.0
- git+https://github.com/ray-project/ray_lightning.git@main
Is someone experiencing the same issue? Any kind of help would be very much appreciated! 😃
Have a great day!