Training on Google colab

Question

Training on Google colab

Closed this issue 5 years ago · 18 comments

I am new to machine learning.
I am trying to train the model on Google colab.
Google colab only available for 12 hours.
So how to resume training
For example if model is saved at 8000 iterations
So how to resume training onwards.

Answer 1 · 2019-05-15T07:09:51.000Z

For example, if you want to resume the checkpoint from 8000 iterations, you can
python ELEGANT.py -a Bangs -m train -g 0 -r 8000

Answer 2 · 2019-05-15T08:39:44.000Z

Thank u so much for your quick reply. I will try .

Answer 3 · 2019-05-15T17:09:49.000Z

After 100 th iteration I got following error
File "ELEGANT.py",line 261,in save_scalar_
log
'loss_D':self.loss_D.data.cpu().numpy()[0]
IndexError :too many indices for array

So after doing search on Google I replaced above code as
loss_D':self.loss_D.data.cpu().item()
loss_G:self.losd_G.data.cpu().item()

And
scalar_info['G_loss/'+key]=value.item()
scalar_info['D_loss/'+key]=value.item()

After this,I was able to train the model till 10,000 iterations.

After that when I tried to resume training 10,000 iterations
It was printing message
Finished training.

But in dataset.py max_iter is 200000
So what's wrong ?

I am not using all the images in celebA dataset.
I am using first 10,000 images and according to it I made changes in attribute and landmark file.

Answer 4 · 2019-05-15T17:15:39.000Z

What command did you type?

Answer 5 · 2019-05-16T05:59:58.000Z

I used following command as u suggested.
!python ELEGANT.py -m train -a bangs Smiling -g 0 -r 10000

Answer 6 · 2019-05-18T08:29:18.000Z

Hello,
U have mentioned image size should be 409687 but img_align_celeba has image size 178218.
So I downloaded img_align.7z .I tried to manually unzip the img_celeba.7z but no luck.file is currupted.
I tried to open it with 7z tool but again getting message can't open archive.

Can u provide any other link to download celeb dataset.

Answer 7 · 2019-05-19T01:37:39.000Z

The cropped and aligned images are generated by running preprocess.py. You should download the raw images and preprocess all images using that script.

Answer 8 · 2019-05-19T03:50:19.000Z

Raw image folder is corrupted.
Is there other way?
Can I use already cropped and aligned images having size 178*218
In that case preprocessing is not required,right?

Answer 9 · 2019-05-19T04:38:55.000Z

No. You have to process raw images using that file.

Answer 10 · 2019-05-19T04:46:39.000Z

Ok ,thank u.

Answer 11 · 2019-05-29T05:14:54.000Z

Hello,
As u said to use raw images,so I used raw images ,preprocessed them,but again training stuck at 10000 iteration.
When I started to resume training onwards from 10000 iteration , message get printed
Finished training.

I used following command as u said
python ELEGANT.py -a Bangs -m train -g 0 -r 10000

Answer 12 · 2019-05-30T07:23:24.000Z

The problem was due to some memory issue.
I resolved it.

Answer 13 · 2019-05-30T07:25:06.000Z

I just want to ask u
For the first time while testing the model there is no restore point
So I should keep the argument -r as none,right?

Answer 14 · 2019-05-30T18:00:39.000Z

Testing model requires restoring ckpt. Because you have already trained your model.

Answer 15 · 2019-05-31T12:28:16.000Z

Ok ,I got it.
Thank u very much

Answer 16 · 2019-06-01T09:14:26.000Z

I trained the model on single attribute smiling.
When I was testing the model ,I am getting following warning.

Userwarning:
Volatile was removed and now has no effect.
Use 'with torch.no_grad():' instead
Var =torch.autograd.Variable(tensor,volatile=volatile)

So I replaced volatile flag
With torch.no_grad():
Var=torch.autograd.variable(tensor)

Then I am getting valueError

self.B,self.A=self.tensor2var(self.transform(Image.open(self.args.input),Image.open(self.args.target[0])))
valueError :not enough values to unpack(expected 2,got 0)

Answer 17 · 2019-06-05T10:11:16.000Z

Hello

I have resolved that problem

I replaced code like this
def tensor2var (self,tensors,requires_grad=True)
....
....
with torch.no_grad():
var=torch.autograd.variable(tensor)

And where this function is called
I replaced volatile with requires_grad=True

And as I am testing this model on cloud so I wasn't using g argument because of it ,it was throwing valueError

So I used g argument.

Answer 18 · 2019-06-05T10:12:50.000Z

I want to ask u can we deploy this model on Android app

If yes can u through some light on it?