how to train?
fuzhao123232 opened this issue · 13 comments
Dear author, Thanks for your SOTA job. For train_score.sh, single card(4090), batch 192, when trained to 10 epochs, the loss decreased to 0.35-0.45, and then from 10-800 epochs, the train's loss kept jumping back and forth between 0.35-0.45, feeling it had converged. Then, evaluate and compare the author's paper. The average 5 ° 2cm loss was close to 10 points, and the values of other indicators were also low. That is to say, I reached a local optimum and couldn't decrease. How did the author train it?
my eval result:
eval from author's kps:
The difference between my code and author's code is dataloader:
For to speed CPU IO, I save the pointclouds for get_ittem.
The pointcloud's generate script is as follow:
In fact, I have visual the results to check ,and I found the generated point cloud and RGB image correspond one-to-one.
Therefore, I think my dataprocess is no problem. But the train is difficult.
- To determine if the training process has converged, you can assess this by visualizing the training curve.
- Regarding your question about the higher loss when training with multiple GPUs, this could be due to the increased batch size without appropriately adjusting the learning rate.
- We did not use a teacher model during our training process.
Hello!
I wonder if you succeeded in reproducing the results of the paper.
[图片]
The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time.
안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.
[图片] The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time. 안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.
Thanks!
Did you also trained EnergyNet? Or did you use author's pretrained checkpoint?
I'm also suffering slow convergence... Moreover I trained 1032 epochs and the evaluation score is much lower than yours @fuzhao123232.
I used author's pretrained checkpoint for EnergyNet and my Scorenet weights. I think the trained results might be relevent with the random initial weights。And I think you need to check your datasets is really ok?
Thank you! I think I have to check the dataloader.
If it's fine, could you show me the loss curve recorded in the tensorboard?
Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?
@fuzhao123232 It seems that my dataloader is fine.
Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings?
(I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)
Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?
@fuzhao123232 It seems that my dataloader is fine.
Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings? (I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)
(24.03.14) There was a problem with ground truth label pkl. I think it's going to be solved if I handle this issue