LoRA Training error - KeyError: 'Field "text_zh" does not exist in schema'
Closed this issue · 2 comments
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
I'm trying to train a Hunyuan LoRA, but I'm failing with this line:
KeyError: 'Field "text_zh" does not exist in schema'
Reproduction
I'm training with a training script using train_deepspeed.py
I'm using my own dataset.
My dataset.json has this structure:
{
"file_name": "/home/user/mnt/hunyuan/dataset/0002.png",
"text_en": "PianoStyle android",
"text_zh": "钢琴风格 安卓"
},
{
"file_name": "/home/user/mnt/hunyuan/dataset/0003.png",
"text_en": "PianoStyle castle",
"text_zh": "钢琴风格 城堡"
},
Here's the training script I run:
#!/bin/bash
model='DiT-g/2' # model type
task_flag="pianostylehunyuan" # task flag
resume_module_root=/home/user/mnt/hunyuandiffusers/ckpt/t2i/model/pytorch_model_module.pt
index_file=/home/user/mnt/hunyuan/dataset-output/index_file.json
results_dir=/home/user/mnt/hunyuan/model/
batch_size=1 # training batch size
image_size=1024 # training image resolution
grad_accu_steps=4 # gradient accumulation steps
warmup_num_steps=0 # warm-up steps
lr=0.0002 # learning rate
ckpt_every=120 # create a ckpt every a few steps.
ckpt_latest_every=5000 # create a ckpt named `latest.pt` every a few steps.
rank=32 # rank of lora
max_training_steps=3600 # Maximum training iteration steps
#export world_size=1
export CUDA_VISIBLE_DEVICES=0
PYTHONPATH=./ deepspeed hydit/train_deepspeed.py \
--task-flag ${task_flag} \
--model ${model} \
--training-parts lora \
--rank ${rank} \
--resume \
--resume-module-root ${resume_module_root} \
--lr ${lr} \
--noise-schedule scaled_linear --beta-start 0.00085 --beta-end 0.018 \
--predict-type v_prediction \
--uncond-p 0 \
--uncond-p-t5 0 \
--index-file ${index_file} \
--random-flip \
--batch-size ${batch_size} \
--image-size ${image_size} \
--global-seed 999 \
--grad-accu-steps ${grad_accu_steps} \
--warmup-num-steps ${warmup_num_steps} \
--use-flash-attn \
--use-fp16 \
--ema-dtype none \
--results-dir ${results_dir} \
--ckpt-every ${ckpt_every} \
--max-training-steps ${max_training_steps}\
--ckpt-latest-every ${ckpt_latest_every} \
--log-every 10 \
--deepspeed \
--deepspeed-optimizer \
--use-zero-stage 2 \
--qk-norm \
--rope-img base512 \
--rope-real \
"$@"
# --cpu-offloading \
# --gradient-checkpointing \
# --use-fp16 \
# --ema-dtype fp32 \
Environment
(venv) user@MNeMiC-PC:~/mnt/hunyuan/HunyuanDiT/utils$ python collect_env.py
/home/user/mnt/hunyuan/HunyuanDiT/utils/collect_env.py:84: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
from distutils import errors
sys.platform: linux
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.20
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 90.1 (built against CUDA 12.4)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
Regarding my installation, it's a mess. First time WSL/Linux tinkering here, so it's likely my venv is all sorts of incorrectly created.
Error traceback
If applicable, paste the error trackback here.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/train_deepspeed.py", line 529, in <module>
[rank0]: main(get_args())
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/train_deepspeed.py", line 458, in main
[rank0]: for batch in loader:
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]: data = self._next_data()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank0]: return self._process_data(data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]: data.reraise()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 706, in reraise
[rank0]: raise exception
[rank0]: KeyError: Caught KeyError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 231, in __getitem__
[rank0]: description = self.get_text(ind)
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 221, in get_text
[rank0]: text = self.get_original_text(ind)
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 216, in get_original_text
[rank0]: text = self.index_manager.get_attribute(ind, 'text_zh' if self.enable_CN else 'text_en')
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/IndexKits/index_kits/indexer.py", line 427, in get_attribute
[rank0]: return self.get_attribute_by_index(index, column, shadow=shadow)
[rank0]: File "/home/user/mnt/hunyuan/HunyuanDiT/IndexKits/index_kits/indexer.py", line 407, in get_attribute_by_index
[rank0]: return table[column][index - index_bias].as_py()
[rank0]: File "pyarrow/table.pxi", line 1646, in pyarrow.lib._Tabular.__getitem__
[rank0]: File "pyarrow/table.pxi", line 1732, in pyarrow.lib._Tabular.column
[rank0]: File "pyarrow/table.pxi", line 1668, in pyarrow.lib._Tabular._ensure_integer_index
[rank0]: KeyError: 'Field "text_zh" does not exist in schema'
[2024-08-04 01:21:06,205] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6249
[2024-08-04 01:21:06,205] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'hydit/train_deepspeed.py', '--local_rank=0', '--task-flag', 'pianostylehunyuan', '--model', 'DiT-g/2', '--training-parts', 'lora', '--rank', '32', '--resume', '--resume-module-root', '/home/user/mnt/hunyuandiffusers/ckpt/t2i/model/pytorch_model_module.pt', '--lr', '0.0002', '--noise-schedule', 'scaled_linear', '--beta-start', '0.00085', '--beta-end', '0.018', '--predict-type', 'v_prediction', '--uncond-p', '0', '--uncond-p-t5', '0', '--index-file', '/home/user/mnt/hunyuan/dataset-output/index_file.json', '--random-flip', '--batch-size', '1', '--image-size', '1024', '--global-seed', '999', '--grad-accu-steps', '4', '--warmup-num-steps', '0', '--use-flash-attn', '--use-fp16', '--ema-dtype', 'none', '--results-dir', '/home/user/mnt/hunyuan/model/', '--ckpt-every', '120', '--max-training-steps', '3600', '--ckpt-latest-every', '5000', '--log-every', '10', '--deepspeed', '--deepspeed-optimizer', '--use-zero-stage', '2', '--qk-norm', '--rope-img', 'base512', '--rope-real'] exits with return code = 1
Below is the full log of the training command used:
full_log.txt
I tried setting enable_CN=False,
in arrow_load_stream.py, and now I'm getting the following when I try to start training:
[2024-08-04 01:42:08] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:09] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:10] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:12] (step=0000010) (update_step=0000002) Train Loss: 0.0116, Lr: 0.0002, Steps/Sec: 0.22, Samples/Sec: 0
[2024-08-04 01:42:12] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:13] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:15] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:16] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:18] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:19] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:21] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:23] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:25] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:27] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:29] (step=0000020) (update_step=0000005) Train Loss: 0.0004, Lr: 0.0002, Steps/Sec: 0.57, Samples/Sec: 0
[2024-08-04 01:42:29] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:31] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:32] arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
Both errors were solved by having a correct .arrow-file.
I used an incorrect arrow generation script, and thus when the training started, I didn't actually have any training data for it to train on, making it fail to find the required schemas.