Couldn't restart training

Question

Couldn't restart training

Closed this issue 4 years ago · 15 comments

Hi @akashsengupta1997,
I tried training from scratch on male body, training is too slow. So, I trained upto 30 epoch in 2 days on weekend.
But now i'm getting error while resuming the training from current epoch i.e 30.
`python run_train.py

Device: cuda:0
GPU: 0

ResNet in channels: 18
ResNet layers: 18
IEF Num iters: 3

Batch size: 70
LR: 0.0001
Image width/height: 256

Losses on: ['verts', 'shape_params', 'pose_params', 'joints2D', 'joints3D']
Loss weights: {'verts': 1.0, 'joints2D': 0.1, 'pose_params': 0.1, 'shape_params': 0.1, 'joints3D': 1.0}

Metrics: ['pves', 'pves_sc', 'pves_pa', 'pve-ts', 'pve-ts_sc', 'mpjpes', 'mpjpes_sc', 'mpjpes_pa', 'shape_mses', 'pose_mses', 'joints2D_l2es']
Save val metrics: ['pves_pa', 'mpjpes_pa']

Train path: data/amass_up3d_3dpw_train.npz
Val path: data/up3d_3dpw_val.npz
Model save path: ./checkpoints/model_training/straps_model_checkpoint_exp001
Log save path: ./logs/straps_model_logs_exp001.pkl
Training examples found: 347962
Validation examples found: 11836

Regressor model Loaded. 11909789 trainable parameters.
WARNING: You are using a SMPL model, with only 10 shape coefficients.

SMPL augment params:
{'augment_shape': True, 'delta_betas_distribution': 'normal', 'delta_betas_std_vector': tensor([1.5000, 1.5000, 1.5000, 1.5000, 1.5000, 1.5000, 1.5000, 1.5000, 1.5000,
1.5000], device='cuda:0'), 'delta_betas_range': [-3.0, 3.0]}
Cam augment params:
{'xy_std': 0.05, 'delta_z_range': [-5, 5]}
Crop input: True
BBox augment params
{'crop_input': True, 'mean_scale_factor': 1.2, 'delta_scale_range': [-0.2, 0.2], 'delta_centre_range': [-5, 5]}
Proxy rep augment params
{'remove_appendages': True, 'deviate_joints2D': True, 'deviate_verts2D': True, 'occlude_seg': True, 'remove_appendages_classes': [1, 2, 3, 4, 5, 6], 'remove_appendages_probabilities': [0.1, 0.1, 0.1, 0.1, 0.05, 0.05], 'delta_j2d_dev_range': [-8, 8], 'delta_j2d_hip_dev_range': [-8, 8], 'delta_verts2d_dev_range': [-0.01, 0.01], 'occlude_probability': 0.5, 'occlude_box_dim': 48}
Resuming from: ./checkpoints/model_training/straps_model_checkpoint_exp001_epoch30.tar

Training information loaded from checkpoint.
Current epoch: 31
Best epoch val metrics from last training run: {'pves_pa': 0.045244997332339984, 'mpjpes_pa': 0.03596175746229043} - achieved in epoch: 28
Traceback (most recent call last):
File "run_train.py", line 234, in
pin_memory=pin_memory)
File "/home/ujjawal/my_work/Fashion/3d-body-measurements/STRAPS-3DHumanShapePose-master/train/train_synthetic_otf_rendering.py", line 93, in train_synthetic_otf_rendering
current_epoch=current_epoch)
File "/home/ujjawal/my_work/Fashion/3d-body-measurements/STRAPS-3DHumanShapePose-master/metrics/train_loss_and_metrics_tracker.py", line 41, in init
self.history = self.load_history(log_path, current_epoch)
File "/home/ujjawal/my_work/Fashion/3d-body-measurements/STRAPS-3DHumanShapePose-master/metrics/train_loss_and_metrics_tracker.py", line 87, in load_history
str(current_epoch))
AssertionError: 0 elements in train_pve-ts_pa list when current epoch is 31`

Any idea how to solve this error.
And Anything we can do for fast training other than PC configuration.

Answer 1 · 2021-03-18T17:04:45.000Z

Hi Ujjwal,

Training is mostly because we are using neural mesh renderer to render synthetic inputs on-the-fly during training. I've find that using pytorch3d instead can speed it up a decent amount.
Hmm that is a weird error, for now you could just comment out the assert statements (lines 82-87) in train_loss_and_metrics_tracker.py

Answer 2 · 2021-03-30T07:51:42.000Z

Hi @akashsengupta1997 ,
Thanks for clarify the issue.It could successfully started training.
Here your default epoch is 1000.
Did you get accurate result in 1000 epoch or in lower number of epoch?
I trained it on MALE model upto 130 epoch for now, but the result is not very accurate. It's not correctly fitting the person in image.
I'm sharing some reference image for that...

I'll check after training it more.
Any suggestion from your side is welcome.

Answer 3 · 2021-03-30T11:29:22.000Z

Hmmm that is odd, it should be working OK-ish after 130 epochs... I need to add the testing code that tightly crops bounding boxes around the person in the image, maybe that is causing the problems (as was the case with some of the other github issues).

Answer 4 · 2021-03-30T12:14:08.000Z

Yupe, it's odd. I'll check it with cropped images.
How to get actual body height in real world? When I'm measuring height of the generated body, it's not accurate as per realistic height.
Any transformations to be used to convert camera parameters?
Should there any need of smpl_mean_male_parameters.npz for male body like you have used smpl_mean_neutral_parameters.npz for neutral body ?
Actually I'm using it for body measurements.
And need your suggestions.
Thanks .

Answer 5 · 2021-03-30T19:46:14.000Z

Oh also are you using the male model for both training and testing? (just double checking that you aren't using the neutral model for testing)

You can't predict absolute measurements from a single image - the measurements will all be relative. At best, you can use a real known measurement to normalise the predicted measurements (e.g. normalise using known height).

Hmm what you could do is make your own male_smpl_mean_params_6dpose.npz from the provided file neutral_smpl_mean_params_6dpose.npz by just setting mean shape params to 0 (this is fine since SMPL's shape parameters are actually 0-mean anyway). The pose mean doesn't matter for genders.

Answer 6 · 2021-03-31T04:58:38.000Z

Thanks you so much @akashsengupta1997 ,
I'll definitely double check it. But, I have changed the smpl model path to male in run_train.py (training) and predict_3D.py (testing). And yes, I'm doing the same thing that you suggested (normalizing using known height).
I created my own male_smpl_mean_params_6dpose.npz by changing the shape params to np.zeros(10).
I'll post here after trying your suggested tips.
Thanks again.

Answer 7 · 2021-04-05T06:40:42.000Z

Hey @akashsengupta1997 ,
Now I'm getting comparatively nice body fitting for male body. Hoping same tips would work for female body also.
Can we also use distance between camera and person, so that we can get accurate result?
Once again thanks for your kind guidance.

Answer 8 · 2021-04-07T10:11:27.000Z

Glad to hear it, good luck!

Re camera: yes you can try and inject ground-truth camera parameters into the regressor by setting the initial camera estimate equal to your ground truth. However, note that the camera estimated by the regressor is weak-perspective so you will have to convert your ground-truth camera parameters into weak-perspective.

Answer 9 · 2021-04-07T12:06:58.000Z

Thanks @akashsengupta1997 ,
Will try it.

Answer 10 · 2021-05-13T14:45:57.000Z

Hi @ujjawalcse, how had you processed 30 epochs in 2 days? I have gtx 1070 and my training speed is 6 hours/epoch. Is it normal or am I doing something wrong?

Answer 11 · 2021-05-13T16:22:09.000Z

Hi @Melih1996,
I think it's normal. I was using RTX 2070 Super.

Answer 12 · 2021-05-17T00:08:55.000Z

Hi @ujjawalcse , I start to train on AWS v100 but it shows 1:45 hours/epoch. Did you use pytorch3d to speed up training ? If so how?

Answer 13 · 2021-05-17T10:05:23.000Z

Hi @Melih1996 ,
I didn't use pytorch3d. I trained with the same code with neural mesh renderer .
What is the GPU configuration of your AWS instance. It may not meet my PC configuration.

Answer 14 · 2021-05-17T10:16:18.000Z

Hi @ujjawalcse
I started to use p3.2xlarge which has a single Tesla V100. It has 16gb GPU Ram. I had expected to get at least 1 epoch in an hour but I couldn't get it. I also use the same code without modifying any part. By the way I try to train it in both genders separately and didn't modify anything related to mean shape params. Is it ok to go with the default params or should I reset it to np.zeros(10)?

Answer 15 · 2023-03-07T07:32:38.000Z

hi@ujjawalcse
I would like to obtain your male_smpl_mean_params_6dpose. npz file. Could you provide it to me? My email is 2573147077@qq.com