Jiyao06/GenPose

how to train?

fuzhao123232 opened this issue · 13 comments

Dear author, Thanks for your SOTA job. For train_score.sh, single card(4090), batch 192, when trained to 10 epochs, the loss decreased to 0.35-0.45, and then from 10-800 epochs, the train's loss kept jumping back and forth between 0.35-0.45, feeling it had converged. Then, evaluate and compare the author's paper. The average 5 ° 2cm loss was close to 10 points, and the values of other indicators were also low. That is to say, I reached a local optimum and couldn't decrease. How did the author train it?
image
my eval result:
image
eval from author's kps:
image

I try to train score network using 4 cards(A10 * 4), batch size = 192*4, and during epoch 172, the loss is as follow:
image
during epoch 327
image
It seem to be converged,but the loss is bigger than sigle card(0.35).
I really don't know how to train the score network to reproduce the author's results.

The difference between my code and author's code is dataloader:
For to speed CPU IO, I save the pointclouds for get_ittem.

image

The pointcloud's generate script is as follow:
image
image
image
image

In fact, I have visual the results to check ,and I found the generated point cloud and RGB image correspond one-to-one.
image
image
Therefore, I think my dataprocess is no problem. But the train is difficult.

image
请问作者论文中的结果是否使用了teacher model?最后的train loss大概降到了多少呢?

  1. To determine if the training process has converged, you can assess this by visualizing the training curve.
  2. Regarding your question about the higher loss when training with multiple GPUs, this could be due to the increased batch size without appropriately adjusting the learning rate.
  3. We did not use a teacher model during our training process.

Hello!
I wonder if you succeeded in reproducing the results of the paper.

[图片]
The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time.
image
안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

[图片] The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time. image 안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

Thanks!

eval when epoch is 1664:
image
compare to paper:
image

I think maybe the train epoch will be > 2000, it really .....
Why is the convergence so slow? I will be cry.

Did you also trained EnergyNet? Or did you use author's pretrained checkpoint?
I'm also suffering slow convergence... Moreover I trained 1032 epochs and the evaluation score is much lower than yours @fuzhao123232.

image

I used author's pretrained checkpoint for EnergyNet and my Scorenet weights. I think the trained results might be relevent with the random initial weights。And I think you need to check your datasets is really ok?

Thank you! I think I have to check the dataloader.
If it's fine, could you show me the loss curve recorded in the tensorboard?

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings?
(I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings? (I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

(24.03.14) There was a problem with ground truth label pkl. I think it's going to be solved if I handle this issue