how to train？

Question

how to train？

fuzhao123232 opened this issue 10 months ago · 13 comments

Dear author, Thanks for your SOTA job. For train_score.sh, single card(4090), batch 192, when trained to 10 epochs, the loss decreased to 0.35-0.45, and then from 10-800 epochs, the train's loss kept jumping back and forth between 0.35-0.45, feeling it had converged. Then, evaluate and compare the author's paper. The average 5 ° 2cm loss was close to 10 points, and the values of other indicators were also low. That is to say, I reached a local optimum and couldn't decrease. How did the author train it?

my eval result:

eval from author's kps:

Answer 1 · 2024-02-01T03:31:30.000Z

I try to train score network using 4 cards(A10 * 4), batch size = 192*4, and during epoch 172, the loss is as follow:

during epoch 327

It seem to be converged,but the loss is bigger than sigle card(0.35).
I really don't know how to train the score network to reproduce the author's results.

Answer 2 · 2024-02-01T03:44:35.000Z

The difference between my code and author's code is dataloader:
For to speed CPU IO, I save the pointclouds for get_ittem.

The pointcloud's generate script is as follow:

In fact, I have visual the results to check ,and I found the generated point cloud and RGB image correspond one-to-one.

Therefore, I think my dataprocess is no problem. But the train is difficult.

Answer 3 · 2024-02-01T09:23:16.000Z

请问作者论文中的结果是否使用了teacher model？最后的train loss大概降到了多少呢？

Answer 4 · 2024-02-05T03:15:56.000Z

To determine if the training process has converged, you can assess this by visualizing the training curve.
Regarding your question about the higher loss when training with multiple GPUs, this could be due to the increased batch size without appropriately adjusting the learning rate.
We did not use a teacher model during our training process.

Answer 5 · 2024-02-21T08:20:41.000Z

Hello!
I wonder if you succeeded in reproducing the results of the paper.

Answer 6 · 2024-02-22T03:02:39.000Z

[图片]
The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time.

안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

Answer 7 · 2024-02-22T14:31:03.000Z

[图片] The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time. 안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

Thanks!

Answer 8 · 2024-03-04T08:58:52.000Z

eval when epoch is 1664:

compare to paper:

I think maybe the train epoch will be > 2000, it really .....
Why is the convergence so slow? I will be cry.

Answer 9 · 2024-03-04T11:19:56.000Z

Did you also trained EnergyNet? Or did you use author's pretrained checkpoint?
I'm also suffering slow convergence... Moreover I trained 1032 epochs and the evaluation score is much lower than yours @fuzhao123232.

Answer 10 · 2024-03-05T06:17:19.000Z

I used author's pretrained checkpoint for EnergyNet and my Scorenet weights. I think the trained results might be relevent with the random initial weights。And I think you need to check your datasets is really ok？

Answer 11 · 2024-03-06T04:03:03.000Z

Thank you! I think I have to check the dataloader.
If it's fine, could you show me the loss curve recorded in the tensorboard?

Answer 12 · 2024-03-12T03:49:34.000Z

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings?
(I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

Answer 13 · 2024-03-14T08:31:23.000Z

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings? (I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

(24.03.14) There was a problem with ground truth label pkl. I think it's going to be solved if I handle this issue