podgorskiy/ALAE

Trying to train on Google Colab: What should expect after throwing Transition ended

udithhaputhanthri opened this issue · 8 comments

Thank you for great work @podgorskiy.

This issue has been posted here by @ayulockin. But it was closed without an answer. I have obtained the same problem when I trying to run ALAE on Colab. I do not have any idea what should I expect here. Running will be stopped without throwing any specific error. I have tried to run this with a higher number of epochs also. But it did not work. Can you give me how can I solve this issue/ progress after this.

image

Update:
The kernel is crashing when executing below line 263 in train() function on train_alae.py.

image

By going through the make_dataloader function on dataloader.py, It seems that the issue occurs in, line 129, when calling db.data_loader

image

This issue has arisen in here #67 also.

Update:

I have gone through the dareblopy/data_loader.py. Here it seems that the issue occurs when calling worker.start() in line 95 (shown below).

image

Update:

Tried to run _worker(State()) and that lead to the kernel crash. Then I went through _worker() function from dareblopy/data_loader.py. Seems that crashing happens when calling next(yielder) (line 74 below).

image

This yielder is **<dareblopy.TFRecordsDatasetIterator.ParsedTFRecordsDatasetIterator at 0x7fd9300f80f0>
** object and next(yielder) supposed to return a list as explained in documentation

Still, I could not find out the issue. Your help is really appreciated.

@podgorskiy

Are there any easy replacements for those classes I can do in order to get the training working?

It would be really great if you can let me know whether there is any point where I can move into the general/ standard PyTorch data loading pipeline (or the easier point where I can make a transition) without considering the efficiency that you have achieved using dareblopy library.

Update:

I have implemented a custom dataset object as follows.

class get_dataset(torch.utils.data.Dataset):
    def __init__(self,dir_='dataset_samples/faces/realign1024x1024', channels=3):
        self.dir_=dir_
        self.img_list = glob.glob(f'{dir_}/*.png')
        self.channels= channels
        self.img_size=512
    def transform_func(self):
        return transforms.Compose([transforms.Resize((self.img_size,self.img_size)), transforms.ToTensor()])
    def __len__(self):
        return len(self.img_list)
    def __getitem__(self,idx):
        img=(plt.imread(self.img_list[idx])*255).astype('uint8')
        pil_img= Image.fromarray(img)
        if self.channels==1:
            pil_img=ImageOps.grayscale(pil_img)
        img=self.transform_func()(pil_img)
        return img
    def reset(self, lod, batch_size):
        self.img_size = 2 ** lod

With the standard torch data loader, I was able to run train_alae.py on Colab for bedroom.yaml configuration file. Not exactly sure whether training is happening as expected. Here I attached the sample_35_0.jpg after about 10 minutes from the beginning of the training.

image

But there are some issues with other configuration files. I will update this with those ones.

Update:

Training will be started with bedroom.yaml, celeba.yaml without error.

But for mnist_fc.yaml, mnist.yaml, there were issues with the number of channels. I will try to figure this out.

celeba-hq256.yaml, ffhq.yaml configs gives Cuda memory exceeded errors. I have no idea to solve this issue in Colab. Please let me know if there is a way to solve this.

Update:

The training seems not working properly. So I tried to go through the DareBlopy library again to check whether there are any points I have missed when implementing the custom get_dataset class.

I could not be able to dive into this next(dataset.iterator.record_yielder) function because I think it is in C++. I do not have much experience in C++.

Hey, @udithhaputhanthri this is exactly where I gave up. The final hurdle was to debug C++ code which wasn't something I had to bandwidth to get into.