Tencent/HunyuanDiT

LoRA Training error - KeyError: 'Field "text_zh" does not exist in schema'

Closed this issue · 2 comments

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
I'm trying to train a Hunyuan LoRA, but I'm failing with this line:
KeyError: 'Field "text_zh" does not exist in schema'

Reproduction

I'm training with a training script using train_deepspeed.py
I'm using my own dataset.

My dataset.json has this structure:
{
"file_name": "/home/user/mnt/hunyuan/dataset/0002.png",
"text_en": "PianoStyle android",
"text_zh": "钢琴风格 安卓"
},
{
"file_name": "/home/user/mnt/hunyuan/dataset/0003.png",
"text_en": "PianoStyle castle",
"text_zh": "钢琴风格 城堡"
},

Here's the training script I run:

#!/bin/bash
model='DiT-g/2'                                                   # model type
task_flag="pianostylehunyuan"                                       # task flag
resume_module_root=/home/user/mnt/hunyuandiffusers/ckpt/t2i/model/pytorch_model_module.pt
index_file=/home/user/mnt/hunyuan/dataset-output/index_file.json
results_dir=/home/user/mnt/hunyuan/model/
batch_size=1                                                      # training batch size
image_size=1024                                                   # training image resolution
grad_accu_steps=4                                                 # gradient accumulation steps
warmup_num_steps=0                                                # warm-up steps
lr=0.0002                                                         # learning rate
ckpt_every=120                                                    # create a ckpt every a few steps.
ckpt_latest_every=5000                                            # create a ckpt named `latest.pt` every a few steps.
rank=32                                                           # rank of lora
max_training_steps=3600                                           # Maximum training iteration steps

#export world_size=1
export CUDA_VISIBLE_DEVICES=0

PYTHONPATH=./ deepspeed hydit/train_deepspeed.py \
    --task-flag ${task_flag} \
    --model ${model} \
    --training-parts lora \
    --rank ${rank} \
    --resume \
    --resume-module-root ${resume_module_root} \
    --lr ${lr} \
    --noise-schedule scaled_linear --beta-start 0.00085 --beta-end 0.018 \
    --predict-type v_prediction \
    --uncond-p 0 \
    --uncond-p-t5 0 \
    --index-file ${index_file} \
    --random-flip \
    --batch-size ${batch_size} \
    --image-size ${image_size} \
    --global-seed 999 \
    --grad-accu-steps ${grad_accu_steps} \
    --warmup-num-steps ${warmup_num_steps} \
    --use-flash-attn \
    --use-fp16 \
    --ema-dtype none \
    --results-dir ${results_dir} \
    --ckpt-every ${ckpt_every} \
    --max-training-steps ${max_training_steps}\
    --ckpt-latest-every ${ckpt_latest_every} \
    --log-every 10 \
    --deepspeed \
    --deepspeed-optimizer \
    --use-zero-stage 2 \
    --qk-norm \
    --rope-img base512 \
    --rope-real \
    "$@"

 #   --cpu-offloading \
 #   --gradient-checkpointing \
 #   --use-fp16 \
 #   --ema-dtype fp32 \

Environment

(venv) user@MNeMiC-PC:~/mnt/hunyuan/HunyuanDiT/utils$ python collect_env.py
/home/user/mnt/hunyuan/HunyuanDiT/utils/collect_env.py:84: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  from distutils import errors
sys.platform: linux
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.20
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1  (built against CUDA 12.4)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

Regarding my installation, it's a mess. First time WSL/Linux tinkering here, so it's likely my venv is all sorts of incorrectly created.

Error traceback
If applicable, paste the error trackback here.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/train_deepspeed.py", line 529, in <module>
[rank0]:     main(get_args())
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/train_deepspeed.py", line 458, in main
[rank0]:     for batch in loader:
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: KeyError: Caught KeyError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 231, in __getitem__
[rank0]:     description = self.get_text(ind)
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 221, in get_text
[rank0]:     text =  self.get_original_text(ind)
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/hydit/data_loader/arrow_load_stream.py", line 216, in get_original_text
[rank0]:     text = self.index_manager.get_attribute(ind, 'text_zh' if self.enable_CN else 'text_en')
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/IndexKits/index_kits/indexer.py", line 427, in get_attribute
[rank0]:     return self.get_attribute_by_index(index, column, shadow=shadow)
[rank0]:   File "/home/user/mnt/hunyuan/HunyuanDiT/IndexKits/index_kits/indexer.py", line 407, in get_attribute_by_index
[rank0]:     return table[column][index - index_bias].as_py()
[rank0]:   File "pyarrow/table.pxi", line 1646, in pyarrow.lib._Tabular.__getitem__
[rank0]:   File "pyarrow/table.pxi", line 1732, in pyarrow.lib._Tabular.column
[rank0]:   File "pyarrow/table.pxi", line 1668, in pyarrow.lib._Tabular._ensure_integer_index
[rank0]: KeyError: 'Field "text_zh" does not exist in schema'

[2024-08-04 01:21:06,205] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6249
[2024-08-04 01:21:06,205] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'hydit/train_deepspeed.py', '--local_rank=0', '--task-flag', 'pianostylehunyuan', '--model', 'DiT-g/2', '--training-parts', 'lora', '--rank', '32', '--resume', '--resume-module-root', '/home/user/mnt/hunyuandiffusers/ckpt/t2i/model/pytorch_model_module.pt', '--lr', '0.0002', '--noise-schedule', 'scaled_linear', '--beta-start', '0.00085', '--beta-end', '0.018', '--predict-type', 'v_prediction', '--uncond-p', '0', '--uncond-p-t5', '0', '--index-file', '/home/user/mnt/hunyuan/dataset-output/index_file.json', '--random-flip', '--batch-size', '1', '--image-size', '1024', '--global-seed', '999', '--grad-accu-steps', '4', '--warmup-num-steps', '0', '--use-flash-attn', '--use-fp16', '--ema-dtype', 'none', '--results-dir', '/home/user/mnt/hunyuan/model/', '--ckpt-every', '120', '--max-training-steps', '3600', '--ckpt-latest-every', '5000', '--log-every', '10', '--deepspeed', '--deepspeed-optimizer', '--use-zero-stage', '2', '--qk-norm', '--rope-img', 'base512', '--rope-real'] exits with return code = 1

Below is the full log of the training command used:
full_log.txt

I tried setting enable_CN=False, in arrow_load_stream.py, and now I'm getting the following when I try to start training:

[2024-08-04 01:42:08]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:09]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:10]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:12] (step=0000010) (update_step=0000002) Train Loss: 0.0116, Lr: 0.0002, Steps/Sec: 0.22, Samples/Sec: 0
[2024-08-04 01:42:12]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:13]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:15]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:16]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:18]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:19]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:21]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:23]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:25]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:27]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:29] (step=0000020) (update_step=0000005) Train Loss: 0.0004, Lr: 0.0002, Steps/Sec: 0.57, Samples/Sec: 0
[2024-08-04 01:42:29]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:31]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'
[2024-08-04 01:42:32]     arrow_load_stream | get_raw_image | Error: 'Field "binary" does not exist in schema'

Both errors were solved by having a correct .arrow-file.
I used an incorrect arrow generation script, and thus when the training started, I didn't actually have any training data for it to train on, making it fail to find the required schemas.