Large datasets

Question

Large datasets

Closed this issue 3 years ago · 6 comments

Greetings,
I am having issues with large datasets. I was working on university jupyter lab server on my code, and the dataset would not load and would very often crash (It is around 350mbs), and then I shifted to google colab, and mounted my google drive in order to work with it, and it initially worked fine, but now the other dataset that I added, although is only 53mb, is giving me this issue.
How do you use large datasets in Jupyterlab, or notebook, and is there a way to avoid these errors

Answer 1 · 2022-01-20T14:38:01.000Z

This is a bit strange as a 400mb data is indeed not that big. For really large datasets, chunking can be an option (e.g. https://towardsdatascience.com/loading-large-datasets-in-pandas-11bdddd36f7b), but this should not be necessary here.

Have you installed anaconda on your own computer? Does this give problems as well when you load the data?

Alternatively, have you looked at the data itself to see if there are some errors in there which can create problems when reading in the data?

Answer 2 · 2022-01-20T18:51:25.000Z

So I have used the data in R, I was doing sensitivity analysis, and in R the data works perfectly fine. I have anaconda installed, but even in that when I tried to use the data, the Juptyer Notebook did not input the data. The code ran but nothing happened.

Answer 3 · 2022-01-20T19:02:55.000Z

can you copy paste the python code here where you read in the data?

Answer 4 · 2022-01-20T19:16:13.000Z

Thank you so much for your help. As it turns out, I hadn't tried it in conda, and I just tried it in jupyterlab. It just worked in conda. I'll still attach the codes from jupyterlab because I believe it should've worked there too.

#adding data -

vdemcols = ['country_name', 'country_id', 'country_text_id', 'year', 'v2x_polyarchy', 'v2x_libdem',
'v2cvresp','v2x_regime','v2x_regime_amb', 'e_migdppc', 'e_migdpgrolns', 'e_migdppcln',
'e_wb_pop', 'e_miurbani', 'e_miurbpop', 'e_pelifeex']
vdem = pd.read_csv("AEA Project/V-Dem-CY-Full+Others-v11.1.csv", low_memory = False, usecols= vdemcols)

This is the code. The path is what I have copied from right click, copy path. I also tried just typing the name of the file because they're in the same folder, but neither worked. Once, I dont remember exactly how, I got around this, but then when i tried calling the dataset, it said "vdem doesn't exist"

Answer 5 · 2022-01-21T18:42:45.000Z

Just to be sure: your problems are solved when working in conda on your own computer?

The code looks fine, although I would not use '+' or '-' in a file name.

I have no idea why this would not work on the university server or colabs.

Answer 6 · 2022-01-21T18:48:24.000Z

The only place this code is not working is jupyterlab for some reason. I googled and as it turns out, colab has a default size limit of 10mbs for files, and i changed it to uploaded, and on conda it worked normally.
But thank you so much for the help. Maybe i'll try to execute the same code in jupyterlab server after the submission!