hukkelas/DeepPrivacy

Train model from scratch with different facial keypoints

ravi97 opened this issue · 10 comments

Thanks for your code! I am trying to train a model from scratch using my own facial keypoint detection network (rather than using the standard pre-trained mask RCNN provided in this repository). My keypoint network provides 63 facial points as compared to the 7 points used by your code. Is it possible to train the network with 63 keypoints? Where should I make the changes to your code? Thanks in advance!

Should be no problem!

You need to change the following:

  1. The config need to specify the new landmarks. I suggest that you inherit the original configs from the fdf dataset, for example by creating a new config in the path configs/fdf/more_landmarks.py (also see this issue about creating a new dataset):
import os

_base_config_ = "../base.py" # Inherit all settings from the FDF dataset.

models = dict(
    pose_size=63*2 # Define number of keypoints, 63*2 for (X,Y)
)
dataset_type = "FDFDataset" # You might want to change this to a new dataset type.
data_root = os.path.join("data", "fdf")
data_train = dict(
    dataset=dict(
        type=dataset_type,
        dirpath=os.path.join(data_root, "train"),
        percentage=1.0
    ),
    transforms=[
        dict(type="RandomFlip", flip_ratio=0.5),
        dict(type="FlattenLandmark")
    ],
)
data_val = dict(
    dataset=dict(
        type=dataset_type,
        dirpath=os.path.join(data_root, "val"),
        percentage=.2
    ),
    transforms=[
        dict(type="FlattenLandmark")
    ],
)

landmarks = [ # Change this
    "Nose",
    "Left Eye",
    "Right Eye",
    "Left Ear",
    "Right Ear",
    "Left Shoulder",
    "Right Shoulder"
]
anonymizer = dict(
    detector_cfg=dict(
        type="RCNNDetector",
        simple_expand=False,
        rcnn_batch_size=8,
        face_detector_cfg=dict(
            name="RetinaNetResNet50", # You might want to implement your own detector
        )
    )
)
  1. Extend the FDFDataset (see here for implementation) to load your now landmarks (assuming that you are using the FDF dataset. We did something similar with landmarks from RetinaNet, and you can see the implementation here for an example.
  2. Retrain the model.

Hope this answered your questions.

Hello! Thank you so much for your explanation! How do you retrain the model? I tried python train.py configs/fdf/more_landmarks.py but it throws the error NameError: name 'FusedAdam' is not defined. How do I resolve this?

You have to install NVIDIA APEX with CUDA and C++ extensions to enable mixed precision training. It should be straightforward to install and increases training time significantly with GPUs with tensor cores.

Sorry about that. I just saw that you had already mentioned about Apex in the documentation. I just installed Apex and the train file seems to work now. But I am running into this issue while trying to access the dataset: AssertionError: Did not find file. Looked at: data/fdf/train/bounding_box/128.torch. As instructed here (https://github.com/hukkelas/FDF) I created a folder called train inside the data/fdf folder and ran python download.py --target_directory data/fdf --download_images as instructed in another issue. Should I change the filename in the metadata you provided?

Oh, yeah I have changed the storage format somewhat between v1 and v2.
The keypoints in the metadata is the same, but you can download these files here:
https://drive.google.com/file/d/1JLa8a5wzfVn3BOjNACkGhRL9kK8gTmb4/view?usp=sharing

If you're missing anymore, please tell me!

Hello! Thanks for the keypoints! I made the changes you mentioned, but I am getting the following error:

File "DeepPrivacy/deep_privacy/modeling/loss/optimizer.py", line 164, in step_D
fake_scores = np.array(real_scores).reshape(-1)
File "/home/user/.local/lib/python3.8/site-packages/torch/tensor.py", line 630, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

But real_scores is a list and not a torch tensor. I tried using real_scores.cpu() instead of real_scores, but it didn't work. Is this because of version mismatches in any supporting library?

Took a look at it now, and it was working on my computer. However, this operation was flawed. Updated the repository to better handle logging of discriminator scores, and this should fix your problem:
See: 266eb4e

Hello! Thanks for making the change! I am facing a similar error.
self.discriminator.forward_fake(**batch, with_pose=True, fake_img=fake_data[idx]) at line 147 of deep_privacy/loss/optimizer.py is returning a list instead of a tensor. I am thinking that this error might be due to changing the batch size. I had to reduce the batch size since my GPU could not handle a batch size of 32. I changed the batch size by modifying the config file (configs/fdf/512.py) in the following manner:
trainer = dict( progressive=dict( enabled=False ), batch_size_schedule={ 128: 16, 256: 32 }, optimizer=dict( learning_rate=0.0015 ) )

Is this the right way to change the batch size?

I was able to resolve the error by making the following change to the file deep_privacy/modelling/loss/optimizer.py:

Line 166: log[f"fake_score{i}"] = fake_score[0].mean().detach()

Now the training is working, but I am getting the following output for a long time:

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 3 reducing loss scale to 1024.0

Is this because of low GPU memory? I am using batch size = 1 (for both the GAN and RCNN detector) and Nvidia RTX 2060 GPU with 6 GB of memory.

I learned that this is normal behavior with Apex. I will close the issue now. Thanks for all your help!