How to fine tune a model?

Question

How to fine tune a model?

Mocujin933 opened this issue 2 years ago · 15 comments

Hello! @felixrosberg I'd like to try to finetune the FaceDancer_config_c_HQ.h5 model on my own dataset. Please tell me how can this be done? Could you please share the full command on how to train cofig_c_HQ and if possible share FaceDancer_config_c_HQ not exported to .h5 (dis, gen, state, config folders). I will be very grateful to you for your help. Thank you!

Answer 1 · 2023-02-04T10:16:18.000Z

Hi @Mocujin933 !

Answering on the phone, so bear with me.

If you want to fine tune, I recommend replacing this:

G = get_generator(up_types=opt.up_types,
                      mapping_depth=opt.mapping_depth,
                      mapping_size=opt.mapping_size)

With:

G = load_model(opt.facedancer_path, compile=False,
                   custom_objects={"AdaIN": AdaIN,
                                   "AdaptiveAttention": AdaptiveAttention,
                                   "InstanceNormalization": InstanceNormalization})

Then you should just run the train.py as is. (You may have to import some things)

The training code should save checkpoints automatically, however these will be .h5 files as well, but only the weights and not the full model.

I can try share the command later as I am on my phone currently and just need make sure it runs correctly.

Answer 2 · 2023-02-04T12:27:10.000Z

Thank you @felixrosberg I have one more question for you! I'm a little confused by the paths marked as ".." in the code with how folders are created in the FaceDancer directory when the model starts training. If I understand correctly, the directories that are created should look like this? For example, I specified only the folders (without files that are generated internally) that should be created. And the eval dataset is the same dataset as for training the model?

FaceDancer
├── checkpoints
│   └── facedancer
│       ├── dis
│       ├── gen
│       └── state
├── config
│   └── facedancer
├── exports
│   └── facedancer
├── logs
│   └── runs
│       └── facedancer
└── results

Answer 3 · 2023-02-04T14:28:47.000Z

Yes so "../" means that you "go out one step", so if you have your script inside folder_0 like this: "folder_0/script.py" and you have a folder_1 that contains a file like this "folder_1/file.txt" and you want to access that file when you run script.py inside folder_0 you can specify "../folder_1/file.txt". This is relative so if you run the script from the parent folder like this: "python folder_0/script.py", you don't have to use "../" anymore.

Your folder structure looks correct. The validation data is not the same, I used the validation set of VGGFace2 for this.

Answer 4 · 2023-02-04T15:03:45.000Z

Yes so "../" means that you "go out one step", so if you have your script inside folder_0 like this: "folder_0/script.py" and you have a folder_1 that contains a file like this "folder_1/file.txt" and you want to access that file when you run script.py inside folder_0 you can specify "../folder_1/file.txt". This is relative so if you run the script from the parent folder like this: "python folder_0/script.py", you don't have to use "../" anymore.

Your folder structure looks correct. The validation data is not the same, I used the validation set of VGGFace2 for this.

And if there is no validation dataset, since I made it myself (54784 images, the file structure is the same as in VGGface2), what is the best way to proceed?

Answer 5 · 2023-02-04T15:09:54.000Z

I made a dataset, also did crop and align and sharded cropped and aligned data with provided scripts. Everything should be done correctly and with standard values as in the Readme. I tried to train the model, but I see the error "need at least one array to concatenate" and nothing is saved - neither the model nor the test samples in the ./results folder. What could it be?

Answer 6 · 2023-02-04T15:54:52.000Z

I ran the command like this:
python train/train.py --data_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --eval_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --batch_size 4

and it caused to "need at least one array to concatenate"
When i changed batch_size to 6 - i got this error - all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 256

Answer 7 · 2023-02-05T16:45:16.000Z

Hi @felixrosberg! In general, if you start training the model from scratch, this error does not allow you to save the dis.json, dis_0.h5, gen.json, gen_0.h5 and 0.json files. If you comment log_image in train.py, these files will be created correctly. But if you uncomment log_image and continue training with the --load 0 parameter - after a while this error appears again (I stopped the training after 100 it and resumed with the uncommented line. The training is counted from 0 and when it reaches 100 - this error starts again). I also noticed that you don't use ArcFacePerceptual-Res50.h5 anywhere in the code, even though it is listed as a required element in the README. Also, the --shift, --scale, --z_id_size, --shuffle, --result_dir parameters are not used anywhere in train.py. I would be very grateful for your feedback and help in resolving these errors, thanks!

Answer 8 · 2023-02-06T12:11:18.000Z

And if there is no validation dataset, since I made it myself (54784 images, the file structure is the same as in VGGface2), what is the best way to proceed?

See the data processing step. If there is no folders and only images, select a folder containing the folder with images.

I made a dataset, also did crop and align and sharded cropped and aligned data with provided scripts. Everything should be done correctly and with standard values as in the Readme. I tried to train the model, but I see the error "need at least one array to concatenate" and nothing is saved - neither the model nor the test samples in the ./results folder. What could it be?

I think I need more information to be of any help here. Where did the error occur? What was input arguments? etc.

I ran the command like this: python train/train.py --data_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --eval_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --batch_size 4

and it caused to "need at least one array to concatenate" When i changed batch_size to 6 - i got this error - all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 256

Same as above, need more context. Could be that the images are of different sizes for some reason? They should all be 256x256. I guess you can run 128x128 or other multiples of 2, but you may to make adjustments in the code.

Hi @felixrosberg! In general, if you start training the model from scratch, this error does not allow you to save the dis.json, dis_0.h5, gen.json, gen_0.h5 and 0.json files. If you comment log_image in train.py, these files will be created correctly. But if you uncomment log_image and continue training with the --load 0 parameter - after a while this error appears again (I stopped the training after 100 it and resumed with the uncommented line. The training is counted from 0 and when it reaches 100 - this error starts again). I also noticed that you don't use ArcFacePerceptual-Res50.h5 anywhere in the code, even though it is listed as a required element in the README. Also, the --shift, --scale, --z_id_size, --shuffle, --result_dir parameters are not used anywhere in train.py. I would be very grateful for your feedback and help in resolving these errors, thanks!

Considering you seem to partially fix this without log_image, it seem to support the idea of your images being different sizes. Make sure they are correctly aligned and resized accordingly if that is the case. You are correct, I seem to initialize the perceptual ArcFace using the regular ArcFace. I will adjust the READ.ME.

Answer 9 · 2023-02-06T12:31:01.000Z

And if there is no validation dataset, since I made it myself (54784 images, the file structure is the same as in VGGface2), what is the best way to proceed?

See the data processing step. If there is no folders and only images, select a folder containing the folder with images.

I made a dataset, also did crop and align and sharded cropped and aligned data with provided scripts. Everything should be done correctly and with standard values as in the Readme. I tried to train the model, but I see the error "need at least one array to concatenate" and nothing is saved - neither the model nor the test samples in the ./results folder. What could it be?

I think I need more information to be of any help here. Where did the error occur? What was input arguments? etc.

I ran the command like this: python train/train.py --data_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --eval_dir "./test_2_aligned_shards/train/test_dataset_train_*-of-*.records" --batch_size 4
and it caused to "need at least one array to concatenate" When i changed batch_size to 6 - i got this error - all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 256

Same as above, need more context. Could be that the images are of different sizes for some reason? They should all be 256x256. I guess you can run 128x128 or other multiples of 2, but you may to make adjustments in the code.

Hi @felixrosberg! In general, if you start training the model from scratch, this error does not allow you to save the dis.json, dis_0.h5, gen.json, gen_0.h5 and 0.json files. If you comment log_image in train.py, these files will be created correctly. But if you uncomment log_image and continue training with the --load 0 parameter - after a while this error appears again (I stopped the training after 100 it and resumed with the uncommented line. The training is counted from 0 and when it reaches 100 - this error starts again). I also noticed that you don't use ArcFacePerceptual-Res50.h5 anywhere in the code, even though it is listed as a required element in the README. Also, the --shift, --scale, --z_id_size, --shuffle, --result_dir parameters are not used anywhere in train.py. I would be very grateful for your feedback and help in resolving these errors, thanks!

Considering you seem to partially fix this without log_image, it seem to support the idea of your images being different sizes. Make sure they are correctly aligned and resized accordingly if that is the case. You are correct, I seem to initialize the perceptual ArcFace using the regular ArcFace. I will adjust the READ.ME.

All images in the dataset are cropped and aligned to 256x256 using scripts from the repository. The file structure of dataset is exactly the same as in VGGface2. Is this error possible due to the fact that for eval I use the same dataset on which I train the model?

Answer 10 · 2023-02-06T12:36:33.000Z

@felixrosberg please write the command with arguments with which you trained the FaceDancer_config_c_HQ.h5 model

Answer 11 · 2023-02-06T14:43:34.000Z

@felixrosberg I did an experiment - from my dataset (which is cropped and aligned, 256x256) - moved about one fifth of the data to another folder to create eval data (the file structure of both datasets is the same as in VGGface2). I sharded the main dataset and sharded the eval dataset (eval - with the --data_type val parameter). As a result, we have two sharded datasets with different data. But still getting an error - "all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 768"

I really don't understand what is the problem?

And one more thing: if you change batch_size, the error will also be different.
batch_size = 4:
"need at least one array to concatenate"

batch_size = 6:
"all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 256"

batch_size = 8:
"all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1280 and the array at index 4 has size 768"

As example this is file structure of my datasets:

test_dataset_train_tfrecords
└── train
    ├── test_dataset_train_00000-of-00036.records
    ├── test_dataset_train_00001-of-00036.records
    ├── test_dataset_train_00002-of-00036.records
    ├── test_dataset_train_00003-of-00036.records
    ├── test_dataset_train_00004-of-00036.records
    ├── test_dataset_train_00005-of-00036.records
    ├── test_dataset_train_00006-of-00036.records
    ├── test_dataset_train_00007-of-00036.records
    ├── test_dataset_train_00008-of-00036.records
    ├── test_dataset_train_00009-of-00036.records
    ├── test_dataset_train_00010-of-00036.records
    ├── test_dataset_train_00011-of-00036.records
    ├── test_dataset_train_00012-of-00036.records
    ├── test_dataset_train_00013-of-00036.records
    ├── test_dataset_train_00014-of-00036.records
    ├── test_dataset_train_00015-of-00036.records
    ├── test_dataset_train_00016-of-00036.records
    ├── test_dataset_train_00017-of-00036.records
    ├── test_dataset_train_00018-of-00036.records
    ├── test_dataset_train_00019-of-00036.records
    ├── test_dataset_train_00020-of-00036.records
    ├── test_dataset_train_00021-of-00036.records
    ├── test_dataset_train_00022-of-00036.records
    ├── test_dataset_train_00023-of-00036.records
    ├── test_dataset_train_00024-of-00036.records
    ├── test_dataset_train_00025-of-00036.records
    ├── test_dataset_train_00026-of-00036.records
    ├── test_dataset_train_00027-of-00036.records
    ├── test_dataset_train_00028-of-00036.records
    ├── test_dataset_train_00029-of-00036.records
    ├── test_dataset_train_00030-of-00036.records
    ├── test_dataset_train_00031-of-00036.records
    ├── test_dataset_train_00032-of-00036.records
    ├── test_dataset_train_00033-of-00036.records
    ├── test_dataset_train_00034-of-00036.records
    ├── test_dataset_train_00035-of-00036.records
    └── test_dataset_train_00036-of-00036.records

1 directory, 37 files


test_dataset_val_tfrecords
└── val
    ├── test_dataset_val_00000-of-00005.records
    ├── test_dataset_val_00001-of-00005.records
    ├── test_dataset_val_00002-of-00005.records
    ├── test_dataset_val_00003-of-00005.records
    ├── test_dataset_val_00004-of-00005.records
    └── test_dataset_val_00005-of-00005.records

1 directory, 6 files

(In these datasets, I only show two folders, as this would be a very long log.)
test_dataset_train_aligned
├── n000003
│   ├── 0000_01.jpg
│   ├── 0001_01.jpg
│   ├── 0002_01.jpg
│   ├── 0003_01.jpg
│   ├── 0004_01.jpg
│   ├── 0005_01.jpg
│   ├── 0006_01.jpg
│   ├── 0007_01.jpg
│   ├── 0008_01.jpg
│   ├── 0009_01.jpg
│   ├── 0010_01.jpg
│   ├── 0011_01.jpg
│   ├── 0012_01.jpg
│   ├── 0013_01.jpg
│   ├── 0014_01.jpg
│   ├── 0015_01.jpg
│   ├── 0016_01.jpg
│   ├── 0017_01.jpg
│   ├── 0018_01.jpg
│   ├── 0019_01.jpg
│   ├── 0020_01.jpg
│   ├── 0021_01.jpg
│   ├── 0022_01.jpg
│   ├── 0023_01.jpg
│   ├── 0024_01.jpg
│   ├── 0025_01.jpg
│   ├── 0026_01.jpg
│   ├── 0027_01.jpg
│   ├── 0028_01.jpg
│   ├── 0029_01.jpg
│   ├── 0030_01.jpg
│   ├── 0031_01.jpg
│   ├── 0032_01.jpg
│   ├── 0033_01.jpg
│   ├── 0034_01.jpg
│   ├── 0035_01.jpg
│   ├── 0036_01.jpg
│   ├── 0037_01.jpg
│   ├── 0038_01.jpg
│   ├── 0039_01.jpg
│   ├── 0040_01.jpg
│   ├── 0041_01.jpg
│   ├── 0042_01.jpg
│   ├── 0043_01.jpg
│   ├── 0044_01.jpg
│   ├── 0045_01.jpg
│   ├── 0046_01.jpg
│   ├── 0047_01.jpg
│   ├── 0048_01.jpg
│   └── 0049_01.jpg
├── n000004
│   ├── 0000_01.jpg
│   ├── 0001_01.jpg
│   ├── 0002_01.jpg
│   ├── 0003_01.jpg
│   ├── 0004_01.jpg
│   ├── 0005_01.jpg
│   ├── 0006_01.jpg
│   ├── 0007_01.jpg
│   ├── 0008_01.jpg
│   ├── 0009_01.jpg
│   ├── 0010_01.jpg
│   ├── 0011_01.jpg
│   ├── 0012_01.jpg
│   ├── 0013_01.jpg
│   ├── 0014_01.jpg
│   ├── 0015_01.jpg
│   ├── 0016_01.jpg
│   ├── 0017_01.jpg
│   ├── 0018_01.jpg
│   ├── 0019_01.jpg
│   ├── 0020_01.jpg
│   ├── 0021_01.jpg
│   ├── 0022_01.jpg
│   ├── 0023_01.jpg
│   ├── 0024_01.jpg
│   ├── 0025_01.jpg
│   ├── 0026_01.jpg
│   ├── 0027_01.jpg
│   ├── 0028_01.jpg
│   ├── 0029_01.jpg
│   ├── 0030_01.jpg
│   ├── 0031_01.jpg
│   ├── 0032_01.jpg
│   ├── 0033_01.jpg
│   ├── 0034_01.jpg
│   ├── 0035_01.jpg
│   ├── 0036_01.jpg
│   ├── 0037_01.jpg
│   ├── 0038_01.jpg
│   ├── 0039_01.jpg
│   ├── 0040_01.jpg
│   ├── 0041_01.jpg
│   ├── 0042_01.jpg
│   ├── 0043_01.jpg
│   ├── 0044_01.jpg
│   ├── 0045_01.jpg
│   ├── 0046_01.jpg
│   ├── 0047_01.jpg
│   ├── 0048_01.jpg
│   └── 0049_01.jpg


test_dataset_val_aligned
├── n000001
│   ├── 0000_01.jpg
│   ├── 0001_01.jpg
│   ├── 0002_01.jpg
│   ├── 0003_01.jpg
│   ├── 0004_01.jpg
│   ├── 0005_01.jpg
│   ├── 0006_01.jpg
│   ├── 0007_01.jpg
│   ├── 0008_01.jpg
│   ├── 0009_01.jpg
│   ├── 0010_01.jpg
│   ├── 0011_01.jpg
│   ├── 0012_01.jpg
│   ├── 0013_01.jpg
│   ├── 0014_01.jpg
│   ├── 0015_01.jpg
│   ├── 0016_01.jpg
│   ├── 0017_01.jpg
│   ├── 0018_01.jpg
│   ├── 0019_01.jpg
│   ├── 0020_01.jpg
│   ├── 0021_01.jpg
│   ├── 0022_01.jpg
│   ├── 0023_01.jpg
│   ├── 0024_01.jpg
│   ├── 0025_01.jpg
│   ├── 0026_01.jpg
│   ├── 0027_01.jpg
│   ├── 0028_01.jpg
│   ├── 0029_01.jpg
│   ├── 0030_01.jpg
│   ├── 0031_01.jpg
│   ├── 0032_01.jpg
│   ├── 0033_01.jpg
│   ├── 0034_01.jpg
│   ├── 0035_01.jpg
│   ├── 0036_01.jpg
│   ├── 0037_01.jpg
│   ├── 0038_01.jpg
│   ├── 0039_01.jpg
│   ├── 0040_01.jpg
│   ├── 0041_01.jpg
│   ├── 0042_01.jpg
│   ├── 0043_01.jpg
│   ├── 0044_01.jpg
│   ├── 0045_01.jpg
│   ├── 0046_01.jpg
│   ├── 0047_01.jpg
│   ├── 0048_01.jpg
│   └── 0049_01.jpg
├── n000002
│   ├── 0000_01.jpg
│   ├── 0001_01.jpg
│   ├── 0002_01.jpg
│   ├── 0003_01.jpg
│   ├── 0004_01.jpg
│   ├── 0005_01.jpg
│   ├── 0006_01.jpg
│   ├── 0007_01.jpg
│   ├── 0008_01.jpg
│   ├── 0009_01.jpg
│   ├── 0010_01.jpg
│   ├── 0011_01.jpg
│   ├── 0012_01.jpg
│   ├── 0013_01.jpg
│   ├── 0014_01.jpg
│   ├── 0015_01.jpg
│   ├── 0016_01.jpg
│   ├── 0017_01.jpg
│   ├── 0018_01.jpg
│   ├── 0019_01.jpg
│   ├── 0020_01.jpg
│   ├── 0021_01.jpg
│   ├── 0022_01.jpg
│   ├── 0023_01.jpg
│   ├── 0024_01.jpg
│   ├── 0025_01.jpg
│   ├── 0026_01.jpg
│   ├── 0027_01.jpg
│   ├── 0028_01.jpg
│   ├── 0029_01.jpg
│   ├── 0030_01.jpg
│   ├── 0031_01.jpg
│   ├── 0032_01.jpg
│   ├── 0033_01.jpg
│   ├── 0034_01.jpg
│   ├── 0035_01.jpg
│   ├── 0036_01.jpg
│   ├── 0037_01.jpg
│   ├── 0038_01.jpg
│   ├── 0039_01.jpg
│   ├── 0040_01.jpg
│   ├── 0041_01.jpg
│   ├── 0042_01.jpg
│   ├── 0043_01.jpg
│   ├── 0044_01.jpg
│   ├── 0045_01.jpg
│   ├── 0046_01.jpg
│   ├── 0047_01.jpg
│   ├── 0048_01.jpg
│   └── 0049_01.jpg

random sample from train dataset:

random sample from val dataset:

Answer 12 · 2023-02-06T14:47:34.000Z

Yes, I think I know fault now.

Inside log_imge(..., ) you can see it loops and adapts its index for a batch size of 10. Which means a different batch size is not gonna work without adjusting it. Either you try batch size 10 now or the quick fix would be change the 10 to your batch size (e.g. 6) and the 5s to batch size / 2 (e.g. 3). I will fix this when I find the time.

Answer 13 · 2023-02-06T14:59:05.000Z

Yes, I think I know fault now.

Inside log_imge(..., ) you can see it loops and adapts its index for a batch size of 10. Which means a different batch size is not gonna work without adjusting it. Either you try batch size 10 now or the quick fix would be change the 10 to your batch size (e.g. 6) and the 5s to batch size / 2 (e.g. 3). I will fix this when I find the time.

Changed with your recommendation, but still get error. I can’t check with batch_size 10, because it turns out OOM

    def log_image(sw, target, source, iteration, category='validation/'):

        # extract id information
        source_z = ArcFace(tf.image.resize((source + 1) / 2, [112, 112]))
        target_z = ArcFace(tf.image.resize((target + 1) / 2, [112, 112]))

        # generate face swap and reconstruction
        change = (G([target, source_z]) + 1) / 2
        change_s = (G([target, target_z]) + 1) / 2

        target = (target + 1) / 2
        source = (source + 1) / 2

        # stitch images
        r = []
        change = change.numpy()
        change_s = change_s.numpy()
        for i in range(0, opt.batch_size, 5):
            r.append(np.concatenate(change[i:i + 5], axis=1))
            r.append(np.concatenate(change_s[i:i + 5], axis=1))
            r.append(np.concatenate(target[i:i + 5], axis=1))
            r.append(np.concatenate(source[i:i + 5], axis=1))

        c1 = np.concatenate(r, axis=0)
        c1 = np.clip(c1, 0.0, 1.0)

        # log images to tensorboard
        with sw.as_default():
            tf.summary.image(category + 'samples', np.expand_dims(c1, axis=0), step=iteration, max_outputs=10)

Answer 14 · 2023-02-06T15:36:56.000Z

Changed to this and seem to be it works!!!

        for i in range(0, opt.batch_size, opt.batch_size / 2):
            r.append(np.concatenate(change[i:i + opt.batch_size / 2], axis=1))
            r.append(np.concatenate(change_s[i:i + opt.batch_size / 2], axis=1))
            r.append(np.concatenate(target[i:i + opt.batch_size / 2], axis=1))
            r.append(np.concatenate(source[i:i + opt.batch_size / 2], axis=1))

But with this is it works too

        for i in range(0, opt.batch_size, opt.batch_size / 2):
            r.append(np.concatenate(change[i:i + 5], axis=1))
            r.append(np.concatenate(change_s[i:i + 5], axis=1))
            r.append(np.concatenate(target[i:i + 5], axis=1))
            r.append(np.concatenate(source[i:i + 5], axis=1))

@felixrosberg thank you very much for taking the time to resolve this issue! I would still like to ask the last two questions and after that close this topic as solved.

- Which one of the ones I wrote above is correct if they both work? Somewhere else in the code you need to change the values that are tied to the number batch_size 10 so that there is a correct training?
  For example, in tf.summary.image(category + 'samples', np.expand_dims(c1, axis=0), step=iteration, max_outputs=10)
  or eval_dataset = iter(get_tf_dataset(opt.eval_dir, opt.image_size, 10, repeat=True))
- Please write a command for training FaceDancer_config_c_HQ.h5 model!

Answer 15 · 2023-02-09T08:04:21.000Z

Hey, no problem mate!

I have to actually make a sanity check and test this to make sure what is correct. The log_image function is suppose to stitch the results into a single image grid. The dimensions should be related to the batch size.
The command that should match the hyperparamters and architecture for FaceDancer_config_c_HQ.h5 should be:
´´´shell
python train.py --mapping_size=256 --data_dir="C:/path/to/tfrecords/train/vgg_ls3dx4_train_-of-.records" --eval_dir="C:/path/to/tfrecords/validation/vgg_ls3dx4_validation_-of-.records"
´´´

Of course, you have to change the paths of data_dir to your own. Rest of the parameters should be the default value. This command also assume the name and location of ArcFace and ExpressionEmbedder is as described in the README.