kohya-ss/sd-scripts

Best Code for Full SDXL finetuning?

Opened this issue · 2 comments

I am currently working on full SDXL fine-tuning but have encountered challenges in finding the best code due to conflicting information from various sources. Previously, I used the GitHub repository kohya-trainer, which was functional for many cases. However, it has a limitation: as I learned from running the SDXL pipeline in ComfyUI, there are two text encoders for G and L. While I may be mistaken, ComfyUI suggests this. I also observed that the SDXL 1.0 files on Hugging Face support two prompts.

The kohya-trainer code only allows for a single prompt as a tag or caption, which poses a challenge for my use case. To address this, I found another full fine-tuning code in the sd-scripts repository that accommodates two different prompts. However, I'm uncertain whether to trust this code.

To illustrate the differences, here are examples of the metadata files from both repositories:

  1. https://github.com/qaneel/kohya-trainer
    {
    "filename": {
    "tags": "tag for text encoder",
    "train_resolution": [
    896,
    1152}
    ]
    },

  2. https://github.com/mio-nyan/sd-scripts/tree/dev
    {
    "filename": {
    "captionG": "caption for (G) text_encoder2",
    "captionL": "caption for (L) text_encoder1",
    "train_resolution": [
    896,
    1152
    ]
    },
    }

So based on this can anyone suggest me what or which gitub repo to choose for full finetuning of SDXL using both the text encoder..

Second problem which I see in multiple places was related to the token size of clip models as we know the base clip has the token size of 77 but after the release of the LongCLIP https://github.com/beichenzbc/Long-CLIP/tree/main they were able to increase the token length of clip from 77 to 248 the reason which they gave for increasing the token size of the clip was that that they mentioned that the actually effective length of clip is just 20 tokens only which is very less than 77 itself, but I still didn't found any code of finetuning of sdxl using this LongClip model, I hope that this SDXL community of typical generative AI people can help me in solving this issue,

Finally if anyone want to use there fine-tuned SDXL model with ComfyUI they can look at this github repo of SeaArtLab https://github.com/SeaArtLab/ComfyUI-Long-CLIP and still If you are not able to run longclip with the comfyui or any sdxl pipeline they can ask me for sure.

0: 640x448 1 face, 15.9ms
Speed: 2.8ms preprocess, 15.9ms inference, 1.4ms postprocess per image at shape (1, 3, 640, 448)
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDLongClipModel
Loading 1 new model
loaded completely 0.0 450.3515625 True
!!! Exception during processing !!! 'visual.layer1.0.conv1.weight'
Traceback (most recent call last):
  File "/home/ks/comfui/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/home/ks/comfui/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/custom_nodes/ComfyUI-Long-CLIP/long_clip.py", line 469, in do
    clip = CLIP(clip_target, embedding_directory=embedding_directory)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/comfy/sd.py", line 82, in __init__
    self.cond_stage_model = clip(**(params))
                            ^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/custom_nodes/ComfyUI-Long-CLIP/long_clip.py", line 26, in __init__
    self.transformer, _ = longclip.load(version, device=device)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/custom_nodes/ComfyUI-Long-CLIP/long_clip_model/longclip.py", line 78, in load
    model = build_model(state_dict or model.state_dict(), load_from_clip=False).to(device)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ks/comfui/custom_nodes/ComfyUI-Long-CLIP/long_clip_model/model_longclip.py", line 468, in build_model
    vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
                   ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'visual.layer1.0.conv1.weight'

Prompt executed in 16.47 seconds

no matter how i load the longclip-l via the custom node manager install, or by git cloning and instaling all requirements, i keep getting this error when trying to load via the SeaArtLab node. can't search anyone else who have this issue...

Can you please show me the image of your ComfyUI workflow? If its possible