Trying to train on Google Colab: What should expect after throwing Transition ended
udithhaputhanthri opened this issue · 8 comments
Thank you for great work @podgorskiy.
This issue has been posted here by @ayulockin. But it was closed without an answer. I have obtained the same problem when I trying to run ALAE on Colab. I do not have any idea what should I expect here. Running will be stopped without throwing any specific error. I have tried to run this with a higher number of epochs also. But it did not work. Can you give me how can I solve this issue/ progress after this.
Update:
The kernel is crashing when executing below line 263 in train() function on train_alae.py.
By going through the make_dataloader function on dataloader.py, It seems that the issue occurs in, line 129, when calling db.data_loader
This issue has arisen in here #67 also.
Update:
Tried to run _worker(State()) and that lead to the kernel crash. Then I went through _worker() function from dareblopy/data_loader.py. Seems that crashing happens when calling next(yielder) (line 74 below).
This yielder is **<dareblopy.TFRecordsDatasetIterator.ParsedTFRecordsDatasetIterator at 0x7fd9300f80f0>
** object and next(yielder) supposed to return a list as explained in documentation
Still, I could not find out the issue. Your help is really appreciated.
Are there any easy replacements for those classes I can do in order to get the training working?
It would be really great if you can let me know whether there is any point where I can move into the general/ standard PyTorch data loading pipeline (or the easier point where I can make a transition) without considering the efficiency that you have achieved using dareblopy library.
Update:
I have implemented a custom dataset object as follows.
class get_dataset(torch.utils.data.Dataset):
def __init__(self,dir_='dataset_samples/faces/realign1024x1024', channels=3):
self.dir_=dir_
self.img_list = glob.glob(f'{dir_}/*.png')
self.channels= channels
self.img_size=512
def transform_func(self):
return transforms.Compose([transforms.Resize((self.img_size,self.img_size)), transforms.ToTensor()])
def __len__(self):
return len(self.img_list)
def __getitem__(self,idx):
img=(plt.imread(self.img_list[idx])*255).astype('uint8')
pil_img= Image.fromarray(img)
if self.channels==1:
pil_img=ImageOps.grayscale(pil_img)
img=self.transform_func()(pil_img)
return img
def reset(self, lod, batch_size):
self.img_size = 2 ** lod
With the standard torch data loader, I was able to run train_alae.py on Colab for bedroom.yaml configuration file. Not exactly sure whether training is happening as expected. Here I attached the sample_35_0.jpg after about 10 minutes from the beginning of the training.
But there are some issues with other configuration files. I will update this with those ones.
Update:
Training will be started with bedroom.yaml, celeba.yaml without error.
But for mnist_fc.yaml, mnist.yaml, there were issues with the number of channels. I will try to figure this out.
celeba-hq256.yaml, ffhq.yaml configs gives Cuda memory exceeded errors. I have no idea to solve this issue in Colab. Please let me know if there is a way to solve this.
Update:
The training seems not working properly. So I tried to go through the DareBlopy library again to check whether there are any points I have missed when implementing the custom get_dataset class.
I could not be able to dive into this next(dataset.iterator.record_yielder) function because I think it is in C++. I do not have much experience in C++.
Hey, @udithhaputhanthri this is exactly where I gave up. The final hurdle was to debug C++ code which wasn't something I had to bandwidth to get into.