Trying to train on Google Colab: What should expect after throwing Transition ended

Question

Trying to train on Google Colab: What should expect after throwing Transition ended

udithhaputhanthri opened this issue 4 years ago · 8 comments

Thank you for great work @podgorskiy.

This issue has been posted here by @ayulockin. But it was closed without an answer. I have obtained the same problem when I trying to run ALAE on Colab. I do not have any idea what should I expect here. Running will be stopped without throwing any specific error. I have tried to run this with a higher number of epochs also. But it did not work. Can you give me how can I solve this issue/ progress after this.

Answer 1 · 2021-02-09T05:21:58.000Z

Update:
The kernel is crashing when executing below line 263 in train() function on train_alae.py.

By going through the make_dataloader function on dataloader.py, It seems that the issue occurs in, line 129, when calling db.data_loader

This issue has arisen in here #67 also.

Answer 2 · 2021-02-09T05:25:41.000Z

Update:

I have gone through the dareblopy/data_loader.py. Here it seems that the issue occurs when calling worker.start() in line 95 (shown below).

Answer 3 · 2021-02-09T06:39:23.000Z

Update:

Tried to run _worker(State()) and that lead to the kernel crash. Then I went through _worker() function from dareblopy/data_loader.py. Seems that crashing happens when calling next(yielder) (line 74 below).

This yielder is **<dareblopy.TFRecordsDatasetIterator.ParsedTFRecordsDatasetIterator at 0x7fd9300f80f0>
** object and next(yielder) supposed to return a list as explained in documentation

Still, I could not find out the issue. Your help is really appreciated.

Answer 4 · 2021-02-09T06:48:24.000Z

@podgorskiy

Are there any easy replacements for those classes I can do in order to get the training working?

It would be really great if you can let me know whether there is any point where I can move into the general/ standard PyTorch data loading pipeline (or the easier point where I can make a transition) without considering the efficiency that you have achieved using dareblopy library.

Answer 5 · 2021-02-09T10:40:01.000Z

Update:

I have implemented a custom dataset object as follows.

class get_dataset(torch.utils.data.Dataset):
    def __init__(self,dir_='dataset_samples/faces/realign1024x1024', channels=3):
        self.dir_=dir_
        self.img_list = glob.glob(f'{dir_}/*.png')
        self.channels= channels
        self.img_size=512
    def transform_func(self):
        return transforms.Compose([transforms.Resize((self.img_size,self.img_size)), transforms.ToTensor()])
    def __len__(self):
        return len(self.img_list)
    def __getitem__(self,idx):
        img=(plt.imread(self.img_list[idx])*255).astype('uint8')
        pil_img= Image.fromarray(img)
        if self.channels==1:
            pil_img=ImageOps.grayscale(pil_img)
        img=self.transform_func()(pil_img)
        return img
    def reset(self, lod, batch_size):
        self.img_size = 2 ** lod

With the standard torch data loader, I was able to run train_alae.py on Colab for bedroom.yaml configuration file. Not exactly sure whether training is happening as expected. Here I attached the sample_35_0.jpg after about 10 minutes from the beginning of the training.

But there are some issues with other configuration files. I will update this with those ones.

Answer 6 · 2021-02-09T10:55:27.000Z

Update:

Training will be started with bedroom.yaml, celeba.yaml without error.

But for mnist_fc.yaml, mnist.yaml, there were issues with the number of channels. I will try to figure this out.

celeba-hq256.yaml, ffhq.yaml configs gives Cuda memory exceeded errors. I have no idea to solve this issue in Colab. Please let me know if there is a way to solve this.

Answer 7 · 2021-02-11T11:20:29.000Z

Update:

The training seems not working properly. So I tried to go through the DareBlopy library again to check whether there are any points I have missed when implementing the custom get_dataset class.

I could not be able to dive into this next(dataset.iterator.record_yielder) function because I think it is in C++. I do not have much experience in C++.

Answer 8 · 2021-02-22T08:07:43.000Z

Hey, @udithhaputhanthri this is exactly where I gave up. The final hurdle was to debug C++ code which wasn't something I had to bandwidth to get into.