databricks/spark-deep-learning

Horovod Runner is stuck. Not passing through the first epoch after start training.

Opened this issue · 0 comments

Hello, folks!

I am using HorovodRunner within Databricks runtime LTS 14.2 ML with Tensorflow 14.0 through sparkdl. My data is in TFRecords format, and this issue started to happen after 25th June. I migrated my workload to Unity Catalog. I am debugging on my side if there is something that might have changed, but I couldn't find a way to fix this yet.