data loading problem with 89M pairs
youngwanLEE opened this issue ยท 9 comments
Hi, thanks to your excellent work, I have conducted many experiments.
When I trained on a subset of LAION-aesthetic-5+ (about 89M pairs), my training process was killed without specific error message:(
Maybe it occurred at the load_dataset
.
I guess that the number of training sets is too big, but I'm not sure.
I think this problem may be caused by the huggingface's dataset library.
Have you ever faced this problem? and have you tried to train your model on much bigger training set?
Thanks in advance :)
Hello, thanks for utilizing our work ๐
I have a few questions to better understand your issue:
did you try to train with a smaller dataset, to make sure that the issue is caused by the dataset size?could you try to finetune the original Stable Diffusion Unet with this same dataset, using the Hugging Face train_text_to_image.py script, and let us know if you encounter the same issue? This would help identifying where the issue comes from.with how many GPU are you training? If more than one, could you try again with one GPU? Our models were trained with a single GPU, so if you are using more it may be related to this.
edit: sorry for misunderstanding the situation, Iโve checked your discussions [1 2]. We will get back to you soon.
@youngwanLEE Thanks for sharing your update. Happy to know you are working with large-scale data :)
We haven't worked with a dataset as large as the one you're considering (for clarity, we used 0.22M or 2.3M pairs from LAION-Aesthetics V2).
We haven't encountered the errors you mentioned, sudden killed processes during data load.
- We faced some preprocessing issues from problematic image-text pairs (e.g., empty text files or PIL-unreadable images), but these always resulted in error messages.
Sorry for being unable to provide a clear opinion, because we haven't experimented with such large data using multi-GPU training; however, your point (โmay be caused by the huggingface's dataset library.โ) seems reasonable and may be due to multi-gpu loading for a huge dataset [1] [2-korean].
One suggestion would be to report this issue at https://github.com/huggingface/datasets. It would be very appreciated if you could generously share your update and/or solution to this issue.
@bokyeong1015 Thanks for reply :)
BTW, I wonder when your 2M dataset loading #15 issue will finish.
@youngwanLEE Thank you for your inquiry :)
The 2.3M dataset is now downloadable, and please check this link if you are interested!
@bokyeong1015 Thanks!
When I tried to download the data, an error occurred:
--2023-09-03 08:36:39-- https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_2256k.tar.gz
Resolving netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)... 52.219.94.146, 52.219.108.82, 52.219.100.216, ...
Connecting to netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)|52.219.94.146|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-09-03 08:36:40 ERROR 403: Forbidden.
It may be caused by the same address as that of 11K or 212K .
@youngwanLEE thanks for reaching out.
Based on the log message and as you correctly analyzed ("the same address as that of 11K or 212K"),
the URL should be S3_URL="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.25plus/preprocessed_2256k.tar.gz"
_6.5plus
is wrong, and_6.25plus
is correct
Could you kindly try out the above link?
FYI: the dataset details can be found in MODEL_CARD.md
- BK-SDM: 212,776 image-text pairs (i.e., 0.22M pairs) from LAION-Aesthetics V2 6.5+.
- BK-SDM-2M: 2,256,472 image-text pairs (i.e., 2.3M pairs) from LAION-Aesthetics V2 6.25+.
@bokyeong1015 Thanks !! it worked :)
I have another question.
I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv
for each subset.
Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.
I guess that the problem was caused by the data itself.
FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.
Did you also face this problem?
Or did you conduct any pre-processing of the LAION data??
Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9]
Traceback (most recent call last):
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in
main()
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main
for step, batch in enumerate(train_dataloader):
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter
next_batch = next(dataloader_iter)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
IndexError: index 63 is out of bounds for dimension 0 with size 63
@youngwanLEE We would like to handle this as a separate discussion due to a different topic and for making it easy for other people to find in the future. Could you kindly continue the discussion on that link?
I resolved this issue(refer to #32)