Query about Training Time

Question

Query about Training Time

yashkant opened this issue 3 years ago · 19 comments

Hi,

Thanks for the code, I was trying to run the training over CelebA dataset with 4 RTX 6000 GPUs, and see the below log --

[Experiment: celebAOutputDir] [GPU: 0,1,2,3] [Epoch: 0/3000] [D loss: 341.0094299316406] [G loss: 342.2994079589844] [Step: 10] [Alpha: 0.00] [Img Size: 64] [Batch Size: 54] [TopK: 9] [Scale: 256.0]        
Total progress:   0%|                                            | 1/3000 [02:28<123:51:11, 148.67s/it]
Progress to next stage:   0%|                                  | 14/200000 [03:01<620:16:24, 11.17s/it]

I notice that the training time very high, could you confirm if the training took similar time at your end?

Appreciate your help!

-- Yash

41xu commented 3 years ago

same

zzhuolun commented 3 years ago

Same

iris112358 commented 3 years ago

same

ShaoTengLiu commented 3 years ago

Same

Miner-AI commented 3 years ago

same

Answer 1 · 2022-10-19T07:31:14.000Z

Hi!
Maybe we can try to use this set(I used 2080Ti and it didn't went OOM ):
0: {'batch_size': 40, 'num_steps': 12, 'img_size': 32, 'batch_split': 2, 'gen_lr': 6e-5, 'disc_lr': 2e-4},
int(10000): {'batch_size': 20, 'num_steps': 12, 'img_size': 64, 'batch_split': 4, 'gen_lr': 3e-5, 'disc_lr': 1e-4},
int(50000): {'batch_size': 8, 'num_steps': 12, 'img_size': 128, 'batch_split': 8, 'gen_lr': 1e-5, 'disc_lr': 5e-5},
int(200000): {},
for CelebA.

I think that we don't have to go through all the training steps(200000), which is too time consuming. It was observed that when I merely trained the model for about 1h, it could generate not bad images.
PS: This set means that we train 10000 steps, 40000 steps and 150000 steps for 32x32, 64x64, 128x128 respectively. With this set, it will take about 9h to finish the first two stages(32x32 and 64x64) with a 2080Ti.

Answer 2 · 2023-03-15T08:22:03.000Z

@silence-tang You said that 1-hour training model could generate not bad images, but I am curious about how much you set the interval for saving model weights. It takes me at least 3 hours to train a step in your way,but there are 10000 steps totally.

Answer 3 · 2023-03-15T08:32:50.000Z

@silence-tang You said that 1-hour training model could generate not bad images, but I am curious about how much you set the interval for saving model weights. It takes me at least 3 hours to train a step in your way,but there are 10000 steps totally.

I didn't change the interval for saving model weights. I used a single 2080Ti to train the model from scratch and it took about 1.2 sec to process a step, so It wouldn't take much time to finish training 3000 steps. When 3000 steps ware done, the generated images already looked not bad (personal feelings).

Answer 4 · 2023-03-15T08:52:31.000Z

@silence-tang I'm glad you can answer for me. Do you mean 3000 steps is means 3000 epochs in the source code？This is how I train as the image. How can I change it to train faster like you.thanks very much

Answer 5 · 2023-03-15T08:55:59.000Z

@silence-tang 因为我1个step有3000个epoch要训练，您说的1.2秒训练一个step是指训练一个epoch吗？谢谢您的解答，您的解答对我非常关键！

Answer 6 · 2023-03-15T09:00:18.000Z

@silence-tang 因为我1个step有3000个epoch要训练，您说的1.2秒训练一个step是指训练一个epoch吗？谢谢您的解答，您的解答对我非常关键！

您好，是这样的，作者源码里写的3000个epoch只是象征性的写一下，实际训练的时候并不需要真的跑完3000个epoch。你按照curriculums的训练配置开始训练，直到FID不再下降，就可以停止训练了。

我说的1.2秒训练一个step是指处理完一个batch，并非指一个epoch。注意：一个epoch是指跑完整个训练集（20w张人脸图像）。

Answer 7 · 2023-03-15T09:07:50.000Z

@silence-tang 我会按照您这种思路去训练，非常谢谢您的指导！

Answer 8 · 2023-04-16T08:30:40.000Z

@silence-tang 因为我1个step有3000个epoch要训练，您说的1.2秒训练一个step是指训练一个epoch吗？谢谢您的解答，您的解答对我非常关键！

您好，是这样的，作者源码里写的3000个epoch只是象征性的写一下，实际训练的时候并不需要真的跑完3000个epoch。你按照curriculums的训练配置开始训练，直到FID不再下降，就可以停止训练了。

我说的1.2秒训练一个step是指处理完一个batch，并非指一个epoch。注意：一个epoch是指跑完整个训练集（20w张人脸图像）。

您好，我最近也在对本项目进行研究。我在使用预训练模型方面遇到了困难，可否请教一下您（如果您也遇到的话）：我在load预训练模型中的ema.pth/ema2.pth时遇到了如下的问题

Traceback (most recent call last):
  File "D:\Projects\pi-GAN_Reappearance\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap
    fn(i, *args)
  File "D:\Projects\pi-GAN_Reappearance\train.py", line 89, in train
    ema.load_state_dict(torch.load(os.path.join(opt.load_dir, 'ema.pth'), map_location=device))
  File "D:\Projects\pi-GAN_Reappearance\venv\lib\site-packages\torch_ema\ema.py", line 257, in load_state_dict
    self.decay = state_dict["decay"]
TypeError: 'ExponentialMovingAverage' object is not subscriptable

貌似意思是不能对EMA对象进行数据操作，我不太清楚这是哪里的原因，是否是因为torch-ema库中加载ema的函数使用不恰当。
感谢您的解答！

Answer 9 · 2023-04-16T13:12:39.000Z

@silence-tang 因为我1个step有3000个epoch要训练，您说的1.2秒训练一个step是指训练一个epoch吗？谢谢您的解答，您的解答对我非常关键！

您好，是这样的，作者源码里写的3000个epoch只是象征性的写一下，实际训练的时候并不需要真的跑完3000个epoch。你按照curriculums的训练配置开始训练，直到FID不再下降，就可以停止训练了。
我说的1.2秒训练一个step是指处理完一个batch，并非指一个epoch。注意：一个epoch是指跑完整个训练集（20w张人脸图像）。

您好，我最近也在对本项目进行研究。我在使用预训练模型方面遇到了困难，可否请教一下您（如果您也遇到的话）：我在load预训练模型中的ema.pth/ema2.pth时遇到了如下的问题
Traceback (most recent call last):
  File "D:\Projects\pi-GAN_Reappearance\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap
    fn(i, *args)
  File "D:\Projects\pi-GAN_Reappearance\train.py", line 89, in train
    ema.load_state_dict(torch.load(os.path.join(opt.load_dir, 'ema.pth'), map_location=device))
  File "D:\Projects\pi-GAN_Reappearance\venv\lib\site-packages\torch_ema\ema.py", line 257, in load_state_dict
    self.decay = state_dict["decay"]
TypeError: 'ExponentialMovingAverage' object is not subscriptable
貌似意思是不能对EMA对象进行数据操作，我不太清楚这是哪里的原因，是否是因为torch-ema库中加载ema的函数使用不恰当。感谢您的解答！

您好！

看看是不是用的ema的最新版本，尝试用pip install -U git+https://github.com/fadel/pytorch_ema安装最新的github版本的ema
加载ema.pth的过程大概是这样：
ema = ExponentialMovingAverage(generator.parameters(), decay=0.999)
ema.load_state_dict(torch.load(os.path.join(opt.load_dir, "ema.pth"), map_location=device))
按照这些步骤应该是没问题的。

Answer 10 · 2023-07-17T01:40:25.000Z

Hi! Maybe we can try to use this set(I used 2080Ti and it didn't went OOM ): 0: {'batch_size': 40, 'num_steps': 12, 'img_size': 32, 'batch_split': 2, 'gen_lr': 6e-5, 'disc_lr': 2e-4}, int(10000): {'batch_size': 20, 'num_steps': 12, 'img_size': 64, 'batch_split': 4, 'gen_lr': 3e-5, 'disc_lr': 1e-4}, int(50000): {'batch_size': 8, 'num_steps': 12, 'img_size': 128, 'batch_split': 8, 'gen_lr': 1e-5, 'disc_lr': 5e-5}, int(200000): {}, for CelebA.

I think that we don't have to go through all the training steps(200000), which is too time consuming. It was observed that when I merely trained the model for about 1h, it could generate not bad images. PS: This set means that we train 10000 steps, 40000 steps and 150000 steps for 32x32, 64x64, 128x128 respectively. With this set, it will take about 9h to finish the first two stages(32x32 and 64x64) with a 2080Ti.

您好，我将epoch从3000改到100，并且按照您的训练参数进行训练，但是在eopch=12时程序发生错误，且这12轮在RTX4090上跑了近48个小时，想问问您有没有出现这种问题，我该怎样解决？期待您的回复，谢谢！
/train.py", line 192, in train
if dataloader.batch_size != metadata['batch_size']: break
KeyError: 'batch_size'

Answer 11 · 2023-07-17T02:23:17.000Z

Hi! Maybe we can try to use this set(I used 2080Ti and it didn't went OOM ): 0: {'batch_size': 40, 'num_steps': 12, 'img_size': 32, 'batch_split': 2, 'gen_lr': 6e-5, 'disc_lr': 2e-4}, int(10000): {'batch_size': 20, 'num_steps': 12, 'img_size': 64, 'batch_split': 4, 'gen_lr': 3e-5, 'disc_lr': 1e-4}, int(50000): {'batch_size': 8, 'num_steps': 12, 'img_size': 128, 'batch_split': 8, 'gen_lr': 1e-5, 'disc_lr': 5e-5}, int(200000): {}, for CelebA.
I think that we don't have to go through all the training steps(200000), which is too time consuming. It was observed that when I merely trained the model for about 1h, it could generate not bad images. PS: This set means that we train 10000 steps, 40000 steps and 150000 steps for 32x32, 64x64, 128x128 respectively. With this set, it will take about 9h to finish the first two stages(32x32 and 64x64) with a 2080Ti.

您好，我将epoch从3000改到100，并且按照您的训练参数进行训练，但是在eopch=12时程序发生错误，且这12轮在RTX4090上跑了近48个小时，想问问您有没有出现这种问题，我该怎样解决？期待您的回复，谢谢！ /train.py", line 192, in train if dataloader.batch_size != metadata['batch_size']: break KeyError: 'batch_size'

您好。这是正常现象，因为所有的200000个step已经训练完毕了（你可以看到进度条150000/150000，这表示结束了第三阶段的150000个step）。在实际实验过程中，你可以每隔一段时间观察一下输出结果的FID值，当FID不再降低时，即可终止训练。
此外，我的这个配置是针对2080Ti（11GB显存）设置的，4090应该是24GB，可以适当提高batch_size以减少训练时间。pi-GAN已经是比较老的3d-aware image synthesis baseline了，你也可以跑一跑最新的一些paper，速度和生成质量都会更佳。

Answer 12 · 2023-07-17T03:12:25.000Z

Hi! Maybe we can try to use this set(I used 2080Ti and it didn't went OOM ): 0: {'batch_size': 40, 'num_steps': 12, 'img_size': 32, 'batch_split': 2, 'gen_lr': 6e-5, 'disc_lr': 2e-4}, int(10000): {'batch_size': 20, 'num_steps': 12, 'img_size': 64, 'batch_split': 4, 'gen_lr': 3e-5, 'disc_lr': 1e-4}, int(50000): {'batch_size': 8, 'num_steps': 12, 'img_size': 128, 'batch_split': 8, 'gen_lr': 1e-5, 'disc_lr': 5e-5}, int(200000): {}, for CelebA.
I think that we don't have to go through all the training steps(200000), which is too time consuming. It was observed that when I merely trained the model for about 1h, it could generate not bad images. PS: This set means that we train 10000 steps, 40000 steps and 150000 steps for 32x32, 64x64, 128x128 respectively. With this set, it will take about 9h to finish the first two stages(32x32 and 64x64) with a 2080Ti.

您好，我将epoch从3000改到100，并且按照您的训练参数进行训练，但是在eopch=12时程序发生错误，且这12轮在RTX4090上跑了近48个小时，想问问您有没有出现这种问题，我该怎样解决？期待您的回复，谢谢！ /train.py", line 192, in train if dataloader.batch_size != metadata['batch_size']: break KeyError: 'batch_size'

您好。这是正常现象，因为所有的200000个step已经训练完毕了（你可以看到进度条150000/150000，这表示结束了第三阶段的150000个step）。在实际实验过程中，你可以每隔一段时间观察一下输出结果的FID值，当FID不再降低时，即可终止训练。此外，我的这个配置是针对2080Ti（11GB显存）设置的，4090应该是24GB，可以适当提高batch_size以减少训练时间。pi-GAN已经是比较老的3d-aware image synthesis baseline了，你也可以跑一跑最新的一些paper，速度和生成质量都会更佳。

非常感谢！明白了！！！还想问问您，这种用单张图像生成新视角的网络，比如pi-GAN，它用于训练的数据集不需要姿势监督，但是需要已知数据集的姿势分布，就像pi-GAN论文里说Celeba数据集是高斯分布，想问您怎么能得到数据集的姿势分布，或者如果要用pi-GAN训练自己的数据集，对于数据集有什么要求？期待您的回复，谢谢！

Answer 13 · 2023-07-17T03:24:22.000Z

Hi! Maybe we can try to use this set(I used 2080Ti and it didn't went OOM ): 0: {'batch_size': 40, 'num_steps': 12, 'img_size': 32, 'batch_split': 2, 'gen_lr': 6e-5, 'disc_lr': 2e-4}, int(10000): {'batch_size': 20, 'num_steps': 12, 'img_size': 64, 'batch_split': 4, 'gen_lr': 3e-5, 'disc_lr': 1e-4}, int(50000): {'batch_size': 8, 'num_steps': 12, 'img_size': 128, 'batch_split': 8, 'gen_lr': 1e-5, 'disc_lr': 5e-5}, int(200000): {}, for CelebA.
I think that we don't have to go through all the training steps(200000), which is too time consuming. It was observed that when I merely trained the model for about 1h, it could generate not bad images. PS: This set means that we train 10000 steps, 40000 steps and 150000 steps for 32x32, 64x64, 128x128 respectively. With this set, it will take about 9h to finish the first two stages(32x32 and 64x64) with a 2080Ti.

您好，我将epoch从3000改到100，并且按照您的训练参数进行训练，但是在eopch=12时程序发生错误，且这12轮在RTX4090上跑了近48个小时，想问问您有没有出现这种问题，我该怎样解决？期待您的回复，谢谢！ /train.py", line 192, in train if dataloader.batch_size != metadata['batch_size']: break KeyError: 'batch_size'

您好。这是正常现象，因为所有的200000个step已经训练完毕了（你可以看到进度条150000/150000，这表示结束了第三阶段的150000个step）。在实际实验过程中，你可以每隔一段时间观察一下输出结果的FID值，当FID不再降低时，即可终止训练。此外，我的这个配置是针对2080Ti（11GB显存）设置的，4090应该是24GB，可以适当提高batch_size以减少训练时间。pi-GAN已经是比较老的3d-aware image synthesis baseline了，你也可以跑一跑最新的一些paper，速度和生成质量都会更佳。

非常感谢！明白了！！！还想问问您，这种用单张图像生成新视角的网络，比如pi-GAN，它用于训练的数据集不需要姿势监督，但是需要已知数据集的姿势分布，就像pi-GAN论文里说Celeba数据集是高斯分布，想问您怎么能得到数据集的姿势分布，或者如果要用pi-GAN训练自己的数据集，对于数据集有什么要求？期待您的回复，谢谢！

训练的时候并未用到数据集对应的相机pose分布。建议读一下源码volumetric_rendering.py中的sample_camera_positions()函数，它只是在一个自己预定义的高斯分布里采样出相机的位姿，比如theta, phi，用于后续的坐标系转换等等，你如果不想了解实现细节其实是可以不用管这块的。
至于你提到的“用pi-GAN训练自己的数据集”，应该是没有要求的，你把CelebA换成FFHQ估计也行。

Answer 14 · 2023-07-17T04:05:29.000Z

@silence-tang 明白了!非常感谢您的回复，谢谢！！！