exiawsh/StreamPETR

training errors and settings

Terencedu opened this issue · 2 comments

Hi,

Thanks for your open source work. I am really interested in it. Please allow me to ask some questions:

  1. Model size:I trained the R50_nui model with mini-dataset, why is the trained model is 452M and yours 150M, how to reduce the size?
    image
  2. Training error: Why is there no error with "data_time" when epoch=6/8/12 but an error with epoch=10? So I need to set epoch=6x?
    image
  3. Setting: I will use 4 4090(4×24GB) to train the R50_nui model with full-dataset. May I ask how to set bs and lr? num_gpus=4,bs=2,lr=2e-4 or num_gpus=4,bs=4,lr=4e-4?
  4. Focal head: If I set use_hybrid_tokens = True in focal_head to train the model, it would speed up the training time (because of less training feature), but the test FPS remain the same (because remove focal_head in test) and the test accuracy drops a bit?
  5. Setting: I want to use R50_nui to detect nearby objects when the ego is not moving, What are the tips for setting memory_len, topk_proposals, num_query, num_propagated? like topk_proposals=300 and memoroy_len=600 to fusion 2 frames?
  6. How to calculate "resize_lim" based on "final_dim"?
    image

There are many questions and thanks for your patient.

Sorry for late response:

  1. Your model checkpoint contains parameters for both the model and optimizer. And when I uploaded the model, I only uploaded the model parameters. You can read model checkpoints and print keys() to know the detail.
  2. Sorry, I didn't observe this before, I will reproduce it when I have time. lol
  3. Both setting is right. But I think the result of num_gpus=4,bs=4,lr=4e-4 is better.
  4. You are right. So I recommend you to set use_hybrid_tokens = False.
  5. Yes
  6. First you should calculate a base resize_ratio, e.g. 704/1600 = 0.44, Then adjust the upper and lower bounds by 10% -20% respectively e.g. 0.44 x 0.8 = 0.352, 0.44x1.2 = 0.528. But you need to ensure that 1600 x lower bounds (0.352 in this example)>256 (the finaldim of height). Because PIL will crop images.

Thank you for your detailed reply. I learned a lot!

  1. I just saved 'state_dict' and the model size is 154.8M.
  2. OK, it is not a big problem and I will train 60 epochs.