Flux LoRA training stuck at `Discovering cache objects..`

Question

Flux LoRA training stuck at `Discovering cache objects..`

Opened this issue 9 days ago · 2 comments

Hey there,

I've recently updated the SimpleTuner I am using to the latest version of main at the time of writing (c6cdbb0).

Since then, I have been dealing with a weird issue whenever I try to start a LoRA training.

The training completely freezes after the Discovering cache objects.. and Configured backend: lines. My GPU and CPU usage are both near-zero, so it doesn't look like there's any heavy step running in the background.

I've tried running this both on a local 4090 and on a H100 on RunPod, so it doesn't look like an OOM either.

Also: all runs happen in a Docker image based off of nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04.

Is this a known issue?

Cheers

2024-12-23 16:52:24,246 [INFO] VAE Model: /runpod-volume/serverless/models/level_1
2024-12-23 16:52:24,246 [INFO] Default VAE Cache location: 
2024-12-23 16:52:24,246 [INFO] Text Cache location: cache
2024-12-23 16:52:24,246 [WARNING] Updating T5 XXL tokeniser max length to 512 for Flux.
2024-12-23 16:52:24,247 [WARNING] Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28.
2024-12-23 16:52:24,247 [INFO] Enabled NVIDIA TF32 for faster training on Ampere GPUs. Use --disable_tf32 if this causes any problems.
2024-12-23 16:52:25,532 [INFO] Load VAE: /runpod-volume/serverless/models/level_1
2024-12-23 16:52:25,675 [INFO] Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-12-23 16:52:25,860 [INFO] Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-12-23 16:52:26,032 [INFO] Loading OpenAI CLIP-L text encoder from /runpod-volume/serverless/models/level_1/text_encoder..
2024-12-23 16:52:26,077 [INFO] Loading T5 XXL v1.1 text encoder from /runpod-volume/serverless/models/level_1/text_encoder_2..
2024-12-23 16:52:28,270 [INFO] Moving text encoder to GPU.
2024-12-23 16:52:28,416 [INFO] Moving text encoder 2 to GPU.
2024-12-23 16:52:34,201 [INFO] Loading data backend config from /runpod-volume/serverless/loras/af6ab087-58d2-43da-9f8b-28cdd26f1a9d/multidatabackend.json
2024-12-23 16:52:34,206 [INFO] Configuring text embed backend: text-embeds
2024-12-23 16:52:34,210 [INFO] Directory created: /opt/cache/text/af6ab087-58d2-43da-9f8b-28cdd26f1a9d
2024-12-23 16:52:34,211 [INFO] (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-12-23 16:52:34,213 [INFO] Pre-computing null embedding
2024-12-23 16:52:39,674 [WARNING] Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption_dropout_probability=0.1 as a recommended value.
2024-12-23 16:52:39,674 [INFO] Completed loading text embed services.
2024-12-23 16:52:39,674 [INFO] Configuring data backend: debug                                                               
2024-12-23 16:52:39,674 [INFO] (id=debug) Loading bucket manager.
2024-12-23 16:52:39,686 [INFO] (id=debug) Refreshing aspect buckets on main process.
2024-12-23 16:52:39,686 [INFO] Discovering new files...
2024-12-23 16:52:39,687 [INFO] Compressed 7 existing files from 1.
2024-12-23 16:52:39,687 [INFO] No new files discovered. Doing nothing.
2024-12-23 16:52:39,687 [INFO] Statistics: {'total_processed': 0, 'skipped': {'already_exists': 7, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-12-23 16:52:39,694 [WARNING] Key crop_aspect_buckets not found in the current backend config, using the existing value 'None'.
2024-12-23 16:52:39,694 [WARNING] Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-12-23 16:52:39,694 [WARNING] Key config_version not found in the current backend config, using the existing value '2'.
2024-12-23 16:52:39,694 [WARNING] Key hash_filenames not found in the current backend config, using the existing value 'True'.
2024-12-23 16:52:39,694 [INFO] Configured backend: {'id': 'debug', 'config': {'crop': 'true', 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'textfile', 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <src.training.vendor.simpletuner.helpers.data_backend.local.LocalDataBackend object at 0x7fddc345d270>, 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'metadata_backend': <src.training.vendor.simpletuner.helpers.metadata.backends.discovery.DiscoveryMetadataBackend object at 0x7fde58d1fd00>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 1.0        | 7           
2024-12-23 16:52:39,695 [INFO] (id=debug) Collecting captions.
2024-12-23 16:52:39,696 [INFO] (id=debug) Initialise text embed pre-computation using the textfile caption strategy. We have 7 captions to process.
2024-12-23 16:52:40,550 [INFO] (id=debug) Completed processing 7 captions.                                                   
2024-12-23 16:52:40,550 [INFO] (id=debug) Creating VAE latent cache.
2024-12-23 16:52:40,551 [INFO] Directory created: /opt/cache/vae/af6ab087-58d2-43da-9f8b-28cdd26f1a9d                        
2024-12-23 16:52:40,551 [INFO] (id=debug) Discovering cache objects..
(id=debug) Bucket 1.0 caching results: {'not_local': 0, 'already_cached': 0, 'cached': 0, 'total': 7}
2024-12-23 16:52:41,784 [INFO] Configured backend: {'id': 'debug', 'config': {'crop': 'true', 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'textfile', 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <src.training.vendor.simpletuner.helpers.data_backend.local.LocalDataBackend object at 0x7fddc345d270>, 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'metadata_backend': <src.training.vendor.simpletuner.helpers.metadata.backends.discovery.DiscoveryMetadataBackend object at 0x7fde58d1fd00>, 'train_dataset': <src.training.vendor.simpletuner.helpers.multiaspect.dataset.MultiAspectDataset object at 0x7fddad558160>, 'sampler': <src.training.vendor.simpletuner.helpers.multiaspect.sampler.MultiAspectSampler object at 0x7fddad558190>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7fddad558ac0>, 'text_embed_cache': <src.training.vendor.simpletuner.helpers.caching.text_embeds.TextEmbeddingCache object at 0x7fddc35ead40>, 'vaecache': <src.training.vendor.simpletuner.helpers.caching.vae.VAECache object at 0x7fddad5580a0>}

Answer 1 · 2024-12-23T17:41:49.000Z

i'm pretty sure the problem is the use of the runpod volume. these are pretty low quality storage medium.

Answer 2 · 2024-12-23T17:48:33.000Z

would agree in general, but this also happens locally while i never had this issue before