bghira/SimpleTuner

Flux LoRA training stuck at `Discovering cache objects..`

Opened this issue · 2 comments

Hey there,

I've recently updated the SimpleTuner I am using to the latest version of main at the time of writing (c6cdbb0).

Since then, I have been dealing with a weird issue whenever I try to start a LoRA training.

The training completely freezes after the Discovering cache objects.. and Configured backend: lines. My GPU and CPU usage are both near-zero, so it doesn't look like there's any heavy step running in the background.

I've tried running this both on a local 4090 and on a H100 on RunPod, so it doesn't look like an OOM either.

Also: all runs happen in a Docker image based off of nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04.

Is this a known issue?

Cheers

2024-12-23 16:52:24,246 [INFO] VAE Model: /runpod-volume/serverless/models/level_1
2024-12-23 16:52:24,246 [INFO] Default VAE Cache location: 
2024-12-23 16:52:24,246 [INFO] Text Cache location: cache
2024-12-23 16:52:24,246 [WARNING] Updating T5 XXL tokeniser max length to 512 for Flux.
2024-12-23 16:52:24,247 [WARNING] Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28.
2024-12-23 16:52:24,247 [INFO] Enabled NVIDIA TF32 for faster training on Ampere GPUs. Use --disable_tf32 if this causes any problems.
2024-12-23 16:52:25,532 [INFO] Load VAE: /runpod-volume/serverless/models/level_1
2024-12-23 16:52:25,675 [INFO] Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-12-23 16:52:25,860 [INFO] Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-12-23 16:52:26,032 [INFO] Loading OpenAI CLIP-L text encoder from /runpod-volume/serverless/models/level_1/text_encoder..
2024-12-23 16:52:26,077 [INFO] Loading T5 XXL v1.1 text encoder from /runpod-volume/serverless/models/level_1/text_encoder_2..
2024-12-23 16:52:28,270 [INFO] Moving text encoder to GPU.
2024-12-23 16:52:28,416 [INFO] Moving text encoder 2 to GPU.
2024-12-23 16:52:34,201 [INFO] Loading data backend config from /runpod-volume/serverless/loras/af6ab087-58d2-43da-9f8b-28cdd26f1a9d/multidatabackend.json
2024-12-23 16:52:34,206 [INFO] Configuring text embed backend: text-embeds
2024-12-23 16:52:34,210 [INFO] Directory created: /opt/cache/text/af6ab087-58d2-43da-9f8b-28cdd26f1a9d
2024-12-23 16:52:34,211 [INFO] (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-12-23 16:52:34,213 [INFO] Pre-computing null embedding
2024-12-23 16:52:39,674 [WARNING] Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption_dropout_probability=0.1 as a recommended value.
2024-12-23 16:52:39,674 [INFO] Completed loading text embed services.
2024-12-23 16:52:39,674 [INFO] Configuring data backend: debug                                                               
2024-12-23 16:52:39,674 [INFO] (id=debug) Loading bucket manager.
2024-12-23 16:52:39,686 [INFO] (id=debug) Refreshing aspect buckets on main process.
2024-12-23 16:52:39,686 [INFO] Discovering new files...
2024-12-23 16:52:39,687 [INFO] Compressed 7 existing files from 1.
2024-12-23 16:52:39,687 [INFO] No new files discovered. Doing nothing.
2024-12-23 16:52:39,687 [INFO] Statistics: {'total_processed': 0, 'skipped': {'already_exists': 7, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-12-23 16:52:39,694 [WARNING] Key crop_aspect_buckets not found in the current backend config, using the existing value 'None'.
2024-12-23 16:52:39,694 [WARNING] Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-12-23 16:52:39,694 [WARNING] Key config_version not found in the current backend config, using the existing value '2'.
2024-12-23 16:52:39,694 [WARNING] Key hash_filenames not found in the current backend config, using the existing value 'True'.
2024-12-23 16:52:39,694 [INFO] Configured backend: {'id': 'debug', 'config': {'crop': 'true', 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'textfile', 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <src.training.vendor.simpletuner.helpers.data_backend.local.LocalDataBackend object at 0x7fddc345d270>, 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'metadata_backend': <src.training.vendor.simpletuner.helpers.metadata.backends.discovery.DiscoveryMetadataBackend object at 0x7fde58d1fd00>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 1.0        | 7           
2024-12-23 16:52:39,695 [INFO] (id=debug) Collecting captions.
2024-12-23 16:52:39,696 [INFO] (id=debug) Initialise text embed pre-computation using the textfile caption strategy. We have 7 captions to process.
2024-12-23 16:52:40,550 [INFO] (id=debug) Completed processing 7 captions.                                                   
2024-12-23 16:52:40,550 [INFO] (id=debug) Creating VAE latent cache.
2024-12-23 16:52:40,551 [INFO] Directory created: /opt/cache/vae/af6ab087-58d2-43da-9f8b-28cdd26f1a9d                        
2024-12-23 16:52:40,551 [INFO] (id=debug) Discovering cache objects..
(id=debug) Bucket 1.0 caching results: {'not_local': 0, 'already_cached': 0, 'cached': 0, 'total': 7}
2024-12-23 16:52:41,784 [INFO] Configured backend: {'id': 'debug', 'config': {'crop': 'true', 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'textfile', 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <src.training.vendor.simpletuner.helpers.data_backend.local.LocalDataBackend object at 0x7fddc345d270>, 'instance_data_dir': '/runpod-volume/datasets/belgian_blonde', 'metadata_backend': <src.training.vendor.simpletuner.helpers.metadata.backends.discovery.DiscoveryMetadataBackend object at 0x7fde58d1fd00>, 'train_dataset': <src.training.vendor.simpletuner.helpers.multiaspect.dataset.MultiAspectDataset object at 0x7fddad558160>, 'sampler': <src.training.vendor.simpletuner.helpers.multiaspect.sampler.MultiAspectSampler object at 0x7fddad558190>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7fddad558ac0>, 'text_embed_cache': <src.training.vendor.simpletuner.helpers.caching.text_embeds.TextEmbeddingCache object at 0x7fddc35ead40>, 'vaecache': <src.training.vendor.simpletuner.helpers.caching.vae.VAECache object at 0x7fddad5580a0>}
bghira commented

i'm pretty sure the problem is the use of the runpod volume. these are pretty low quality storage medium.

would agree in general, but this also happens locally while i never had this issue before