Exact train config
Closed this issue · 3 comments
Hey @ymingxie !
Thank you for sharing your work!
As shared by you, in the README.md
, you divide the training into 3 phases. where the training has been divided into epochs 20, 25, and 50.
- So Will creating 3 different
config
files with the proposed change to variableMODEL.FUSION.FUSION_ON, MODEL.TRACKING
serve the purpose? - In current
config/train.yaml
hasepoch
set to 499. Is this value correct? - Would be great if you can share the detailed steps for the training !
Regards,
Nitin Bansal
Hi Nitin,
Sure, I will create 3 different config files (maybe today or tomorrow). 499 is not the valid epoch I ran. I usually get the final checkpoint before 70 epochs.
Best,
Yiming
Sure Yiming! That would be great.
Meanwhile, I subdivided my config into three different parts.
According to the phase specified by you. During Phase 2 training. If get the following missing key error:, which might be due to setting GRU_FUSION
to True
, during phase 2, which was absent during phase 1.
File "main.py", line 192, in train model.load_state_dict(state_dict['model']) File "/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for DistributedDataParallel: Missing key(s) in state_dict: "module.fragment_net.gru_fusion.fusion_nets.0.convz.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.0.convz.point_transforms.0.weig ht", "module.fragment_net.gru_fusion.fusion_nets.0.convz.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.0.convr.net.kernel", "module.fragment_net.gru_fusio n.fusion_nets.0.convr.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusion_nets.0.convr.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.0.conv q.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.0.convq.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusion_nets.0.convq.point_transforms.0.bias", "modu le.fragment_net.gru_fusion.fusion_nets.1.convz.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.1.convz.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusion _nets.1.convz.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.1.convr.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.1.convr.point_transforms.0.we ight", "module.fragment_net.gru_fusion.fusion_nets.1.convr.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.1.convq.net.kernel", "module.fragment_net.gru_fus ion.fusion_nets.1.convq.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusion_nets.1.convq.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.2.co nvz.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.2.convz.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusion_nets.2.convz.point_transforms.0.bias", "mo dule.fragment_net.gru_fusion.fusion_nets.2.convr.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.2.convr.point_transforms.0.weight", "module.fragment_net.gru_fusion.fusi on_nets.2.convr.point_transforms.0.bias", "module.fragment_net.gru_fusion.fusion_nets.2.convq.net.kernel", "module.fragment_net.gru_fusion.fusion_nets.2.convq.point_transforms.0. weight", "module.fragment_net.gru_fusion.fusion_nets.2.convq.point_transforms.0.bias".
Nitin
Hi Nitin,
Check here:
https://github.com/neu-vi/PlanarRecon/blob/main/main.py#L182
"RESUME" is set to True by default and the "strict" is set to false when loading the checkpoint. So it should ignore the missing keys.