Horovod Runner is stuck. Not passing through the first epoch after start training.
Opened this issue · 0 comments
camposwalacy commented
Hello, folks!
I am using HorovodRunner within Databricks runtime LTS 14.2 ML with Tensorflow 14.0 through sparkdl. My data is in TFRecords format, and this issue started to happen after 25th June. I migrated my workload to Unity Catalog. I am debugging on my side if there is something that might have changed, but I couldn't find a way to fix this yet.