TencentARC/BrushNet

Finetune SDXL

dydxdt opened this issue ยท 15 comments

dydxdt commented

Thanks for your good work. I use the offered SDXL weights to finetune with my own data, but it seems the loss doesn't converge and I wonder whether the offered weights are trained on 1024 resolution. I test the finetuned model and it cannot learn the style of the training data. Do you have any advice? Thx

I also encountered a similar problem: SDXL model training at 1024 resolution, loss does not seem to converge.

The model training configuration is as follows:

accelerate launch examples/brushnet/train_brushnet_sdxl.py \
--pretrained_model_name_or_path /disk1/BrushNet/data/ckpt/anything-xl \
--brushnet_model_name_or_path /disk1/BrushNet/data/ckpt/random_mask_brushnet_ckpt_sdxl_v0 \
--output_dir runs/logs/selfdata_brushnetsdxl_1024 \
--train_data_dir /disk1/data/self_developed_animate_data \
--resolution 1024 \
--max_train_steps 100000 \
--learning_rate 1e-5 \
--train_batch_size 1 \
--gradient_accumulation_steps 4 \
--tracker_project_name brushnet \
--report_to tensorboard \
--resume_from_checkpoint latest \
--validation_steps 1000 \
--checkpointing_steps 1000 \
--random_mask

The training log shows following:
E1ED3AA6-8556-4638-B5DC-0EBAB59B82FE

How did you @dydxdt solve the problem later?

Haven't figured it out. It sucks. Hope for helpful advice -_- @huiyang865

I also encountered a similar problem: SDXL model training at 1024 resolution, loss does not seem to converge.

The model training configuration is as follows:

accelerate launch examples/brushnet/train_brushnet_sdxl.py \
--pretrained_model_name_or_path /disk1/BrushNet/data/ckpt/anything-xl \
--brushnet_model_name_or_path /disk1/BrushNet/data/ckpt/random_mask_brushnet_ckpt_sdxl_v0 \
--output_dir runs/logs/selfdata_brushnetsdxl_1024 \
--train_data_dir /disk1/data/self_developed_animate_data \
--resolution 1024 \
--max_train_steps 100000 \
--learning_rate 1e-5 \
--train_batch_size 1 \
--gradient_accumulation_steps 4 \
--tracker_project_name brushnet \
--report_to tensorboard \
--resume_from_checkpoint latest \
--validation_steps 1000 \
--checkpointing_steps 1000 \
--random_mask

The training log shows following: E1ED3AA6-8556-4638-B5DC-0EBAB59B82FE

How did you @dydxdt solve the problem later?

Thanks for your reply.

Is the Brushnet part of SDXL and SD1.5 the same structure? I look at the code and find that BrushNet features are not injected into the Refiner module of XL, is that right?

Do these factors limit the convergence of the XL version? Look forward to your further reply. Thank you very much. @dydxdt

@huiyang865 @dydxdt hi, I have trained SDXL version on my custom data (around 35K images) and i have traind it from the scratch and the outputs are pretty good! i have trained 30K iterations with 1024 res. now i am planning to train on 3M images.

(1) you can remove (--random_mask),
(2) look at your dataset if you converted them correctly, (most important)
i. Get better mask images (u can use SAM, Grounding DINO or Oneformer)
ii. Measure their RLE (run-length encoding) correctly.
(3) Try to increase the batch size if u have 40GB+ VRAM. although i used just 1 as well :)

if you have any question feel free to ask!
good luck~

@huiyang865 @dydxdt hi, I have trained SDXL version on my custom data (around 35K images) and i have traind it from the scratch and the outputs are pretty good! i have trained 30K iterations with 1024 res. now i am planning to train on 3M images.

(1) you can remove (--random_mask), (2) look at your dataset if you converted them correctly, (most important) i. Get better mask images (u can use SAM, Grounding DINO or Oneformer) ii. Measure their RLE (run-length encoding) correctly. (3) Try to increase the batch size if u have 40GB+ VRAM. although i used just 1 as well :)

if you have any question feel free to ask! good luck~

@Shuvo001 Thx for your reply:) I have some questions:

  1. you mean you didn't load the offered pretrained SDXL model, just train from scratch? (I think training from scratch is harder to converge, so I haven't tried.)
  2. can you explain RLE (run-length encoding) specifically?

@dydxdt

  1. YES, i have trained from the scratch without using brushnet pretrained model.
  2. you can look at this part:
    def rle2mask(self, mask_rle, shape):# height width

The resulting output of the rle2mask function is mainly a binary mask image where the ones represent the foreground pixels and zeros represent the background pixels. This function convert RLE representations (often used for storage or transmission) back into the original mask image format.
so as i understand we need to convert all the masks into RLE vaules as mask input, then brushnet pipeline will convert them into mask images by itself.

you can use my code to convert you masks into RLE:

`import numpy as np
import json
import cv2
import os

def mask2rle(mask):

"""
Convert a binary mask to RLE.
mask: binary mask of shape (height, width)
Returns RLE as a list of start positions and lengths
"""
pixels = mask.flatten(order='F')  # Flatten in column-major order
pixels = np.concatenate([[0], pixels, [0]])  # Pad with zero at both ends
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1  # Start positions
runs[1::2] -= runs[::2]  # Run lengths
return runs.tolist()

#Define the path to the "data" folder containing multiple subfolders
data_folder = 'BrushData/caption/mask_outputs/'

#Define the path to the output folder where JSON files will be saved
output_folder = 'BrushData/segmentation/'

#Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

#Iterate through the subfolders in the "data" folder
for folder in os.listdir(data_folder):
folder_path = os.path.join(data_folder, folder)
if os.path.isdir(folder_path):
print(f"Processing folder: {folder}")
folder_rle_masks = []
folder_bounding_boxes = []
folder_pred_phrases = []

    #Load the JSON file for this folder
    json_file = os.path.join(folder_path, folder + '_label.json')
    if not os.path.isfile(json_file):
        print(f"JSON file not found: {json_file}. Skipping folder.")
        continue

    with open(json_file, 'r') as f:
        json_data = json.load(f)

    #Extract the "mask" value from the JSON data
    mask_data = json_data.get("mask", [])

    #Extract the "label" and "logit" values from each mask entry
    for entry in mask_data:
        label = entry.get("label", None)
        logit = entry.get("logit", None)
        box = entry.get("box", None)

        #Add the "label" and "logit" values to the "folder_pred_phrases" list
        if label is not None and logit is not None and box is not None:
            folder_pred_phrases.append(f"{label}({logit})")
            folder_bounding_boxes.append(box)

    #Iterate through the files in the subfolder
    for filename in os.listdir(folder_path):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            #Load the image
            image_path = os.path.join(folder_path, filename)
            image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

            #Check if the image is loaded correctly
            if image is None:
                print(f"Failed to load image: {image_path}. Skipping.")
                continue

            #Threshold the image to obtain a binary mask
            _, binary_mask = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

            #Convert the binary mask to RLE format
            rle = mask2rle(binary_mask)
            folder_rle_masks.append(rle)

    #Define the output dictionary for this folder
    output = {
        "mask": folder_rle_masks,
        "pred_phrases": folder_pred_phrases,
        "bbox": folder_bounding_boxes 
    }

    #Save the output to a JSON file with the folder name
    output_filename = folder + '.segmentation'
    output_path = os.path.join(output_folder, output_filename)
    with open(output_path, 'w') as f:
        json.dump(output, f)

    print(f"Processed {folder} folder. Output saved to {output_filename}.")

print("All folders processed.")`

I have tried to follow almost exact brushdata format.
U can modify the alignments of my code bcz i cant keep all my code inside <> ๐Ÿค

@dydxdt

  1. YES, i have trained from the scratch without using brushnet pretrained model.
  2. you can look at this part:
    def rle2mask(self, mask_rle, shape):# height width

The resulting output of the rle2mask function is mainly a binary mask image where the ones represent the foreground pixels and zeros represent the background pixels. This function convert RLE representations (often used for storage or transmission) back into the original mask image format. so as i understand we need to convert all the masks into RLE vaules as mask input, then brushnet pipeline will convert them into mask images by itself.

you can use my code to convert you masks into RLE:

`import numpy as np import json import cv2 import os

def mask2rle(mask):

"""
Convert a binary mask to RLE.
mask: binary mask of shape (height, width)
Returns RLE as a list of start positions and lengths
"""
pixels = mask.flatten(order='F')  # Flatten in column-major order
pixels = np.concatenate([[0], pixels, [0]])  # Pad with zero at both ends
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1  # Start positions
runs[1::2] -= runs[::2]  # Run lengths
return runs.tolist()

#Define the path to the "data" folder containing multiple subfolders data_folder = 'BrushData/caption/mask_outputs/'

#Define the path to the output folder where JSON files will be saved output_folder = 'BrushData/segmentation/'

#Create the output folder if it doesn't exist os.makedirs(output_folder, exist_ok=True)

#Iterate through the subfolders in the "data" folder for folder in os.listdir(data_folder): folder_path = os.path.join(data_folder, folder) if os.path.isdir(folder_path): print(f"Processing folder: {folder}") folder_rle_masks = [] folder_bounding_boxes = [] folder_pred_phrases = []

    #Load the JSON file for this folder
    json_file = os.path.join(folder_path, folder + '_label.json')
    if not os.path.isfile(json_file):
        print(f"JSON file not found: {json_file}. Skipping folder.")
        continue

    with open(json_file, 'r') as f:
        json_data = json.load(f)

    #Extract the "mask" value from the JSON data
    mask_data = json_data.get("mask", [])

    #Extract the "label" and "logit" values from each mask entry
    for entry in mask_data:
        label = entry.get("label", None)
        logit = entry.get("logit", None)
        box = entry.get("box", None)

        #Add the "label" and "logit" values to the "folder_pred_phrases" list
        if label is not None and logit is not None and box is not None:
            folder_pred_phrases.append(f"{label}({logit})")
            folder_bounding_boxes.append(box)

    #Iterate through the files in the subfolder
    for filename in os.listdir(folder_path):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            #Load the image
            image_path = os.path.join(folder_path, filename)
            image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

            #Check if the image is loaded correctly
            if image is None:
                print(f"Failed to load image: {image_path}. Skipping.")
                continue

            #Threshold the image to obtain a binary mask
            _, binary_mask = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

            #Convert the binary mask to RLE format
            rle = mask2rle(binary_mask)
            folder_rle_masks.append(rle)

    #Define the output dictionary for this folder
    output = {
        "mask": folder_rle_masks,
        "pred_phrases": folder_pred_phrases,
        "bbox": folder_bounding_boxes 
    }

    #Save the output to a JSON file with the folder name
    output_filename = folder + '.segmentation'
    output_path = os.path.join(output_folder, output_filename)
    with open(output_path, 'w') as f:
        json.dump(output, f)

    print(f"Processed {folder} folder. Output saved to {output_filename}.")

print("All folders processed.")`

I have tried to follow almost exact brushdata format. U can modify the alignments of my code bcz i cant keep all my code inside <> ๐Ÿค

@Shuvo001
Hi, I also have a question.
could you share ur training parameters please? learning rate, lr scheduler, optimizer, number of GPUs, accumulated gradient batch , noise scheduler

it would be very helpful. Thanks.๐Ÿ˜Š

@huiyang865 @dydxdt hi, I have trained SDXL version on my custom data (around 35K images) and i have traind it from the scratch and the outputs are pretty good! i have trained 30K iterations with 1024 res. now i am planning to train on 3M images.

(1) you can remove (--random_mask), (2) look at your dataset if you converted them correctly, (most important) i. Get better mask images (u can use SAM, Grounding DINO or Oneformer) ii. Measure their RLE (run-length encoding) correctly. (3) Try to increase the batch size if u have 40GB+ VRAM. although i used just 1 as well :)

if you have any question feel free to ask! good luck~

Thanks reply.

I have check these three attention points carefully.

For the first point, there are tow training patterns of "random_mask" and "pre_segmented_mask". I think you train with the "pre_segmented_mask" parttern, and succeed, congratulations!

Could you try with the "random_mask" pattern? This will random generate mask in real-time, while training model. The problem appears in this training pattern, could it be more difficult for sdxl training?

Looking forward to your reply. @Shuvo001

Hi @yuanhangio I saw you started a new PR to update the model.
Thanks a lot for your effort.

Could you share your training recipe in details please? (FP32 or FP16, Batch size, scheduler, learning rate, etc

Struggling with fine tune it on my own data. Thanks ๐Ÿ˜Š

@dydxdt

  1. YES, i have trained from the scratch without using brushnet pretrained model.
  2. you can look at this part:
    def rle2mask(self, mask_rle, shape):# height width

The resulting output of the rle2mask function is mainly a binary mask image where the ones represent the foreground pixels and zeros represent the background pixels. This function convert RLE representations (often used for storage or transmission) back into the original mask image format. so as i understand we need to convert all the masks into RLE vaules as mask input, then brushnet pipeline will convert them into mask images by itself.

you can use my code to convert you masks into RLE:

`import numpy as np import json import cv2 import os

def mask2rle(mask):

"""
Convert a binary mask to RLE.
mask: binary mask of shape (height, width)
Returns RLE as a list of start positions and lengths
"""
pixels = mask.flatten(order='F')  # Flatten in column-major order
pixels = np.concatenate([[0], pixels, [0]])  # Pad with zero at both ends
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1  # Start positions
runs[1::2] -= runs[::2]  # Run lengths
return runs.tolist()

#Define the path to the "data" folder containing multiple subfolders data_folder = 'BrushData/caption/mask_outputs/'

#Define the path to the output folder where JSON files will be saved output_folder = 'BrushData/segmentation/'

#Create the output folder if it doesn't exist os.makedirs(output_folder, exist_ok=True)

#Iterate through the subfolders in the "data" folder for folder in os.listdir(data_folder): folder_path = os.path.join(data_folder, folder) if os.path.isdir(folder_path): print(f"Processing folder: {folder}") folder_rle_masks = [] folder_bounding_boxes = [] folder_pred_phrases = []

    #Load the JSON file for this folder
    json_file = os.path.join(folder_path, folder + '_label.json')
    if not os.path.isfile(json_file):
        print(f"JSON file not found: {json_file}. Skipping folder.")
        continue

    with open(json_file, 'r') as f:
        json_data = json.load(f)

    #Extract the "mask" value from the JSON data
    mask_data = json_data.get("mask", [])

    #Extract the "label" and "logit" values from each mask entry
    for entry in mask_data:
        label = entry.get("label", None)
        logit = entry.get("logit", None)
        box = entry.get("box", None)

        #Add the "label" and "logit" values to the "folder_pred_phrases" list
        if label is not None and logit is not None and box is not None:
            folder_pred_phrases.append(f"{label}({logit})")
            folder_bounding_boxes.append(box)

    #Iterate through the files in the subfolder
    for filename in os.listdir(folder_path):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            #Load the image
            image_path = os.path.join(folder_path, filename)
            image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

            #Check if the image is loaded correctly
            if image is None:
                print(f"Failed to load image: {image_path}. Skipping.")
                continue

            #Threshold the image to obtain a binary mask
            _, binary_mask = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

            #Convert the binary mask to RLE format
            rle = mask2rle(binary_mask)
            folder_rle_masks.append(rle)

    #Define the output dictionary for this folder
    output = {
        "mask": folder_rle_masks,
        "pred_phrases": folder_pred_phrases,
        "bbox": folder_bounding_boxes 
    }

    #Save the output to a JSON file with the folder name
    output_filename = folder + '.segmentation'
    output_path = os.path.join(output_folder, output_filename)
    with open(output_path, 'w') as f:
        json.dump(output, f)

    print(f"Processed {folder} folder. Output saved to {output_filename}.")

print("All folders processed.")`

I have tried to follow almost exact brushdata format. U can modify the alignments of my code bcz i cant keep all my code inside <> ๐Ÿค

Hi, can you share the data preprocessing code please, thanks!

Hi @yuanhangio I saw you started a new PR to update the model. Thanks a lot for your effort.

Could you share your training recipe in details please? (FP32 or FP16, Batch size, scheduler, learning rate, etc

Struggling with fine tune it on my own data. Thanks ๐Ÿ˜Š

I have encountered similar problems, the loss doesn't converge, and the outputs are full of noise, have you solved it? I think it's highly related to fp16 training.

After I replaced fp16-fix-VAE, it seems to be trained normally, although I don't know why.

After I replaced fp16-fix-VAE, it seems to be trained normally, although I don't know why.

Is the loss shaking, too?

Sorry to bother everyone. The loss non-convergence problem confuses me a lot. Does anyone solve it?
@huiyang865 @windson87 @RunpuWei @dydxdt @onewangqianqian

Sorry to bother everyone. The loss non-convergence problem confuses me a lot. Does anyone solve it? @huiyang865 @windson87 @RunpuWei @dydxdt @onewangqianqian

Is it possible that the resolution of the training data is very low? I found that BrushData had some very low resolution data, while SDXL generate 1024x1024