holoviz/datashader

"Killed" with demo example (but too much points in entry?)

darrepac opened this issue · 4 comments

Hi

I run the demo code in datashader homepage without issue:

import datashader as ds, pandas as pd, colorcet
from datashader.utils import export_image
df  = pd.read_csv('data/stored.csv')
cvs = ds.Canvas(plot_width=850, plot_height=500)
agg = cvs.points(df, 'X', 'Y')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='log')
export_image(img, "out")

now I move to another database and I got:
# python3 map.py **Killed**
Here is the number of lat/lon point in my files:
# cat data/stored.csv | wc -l 185341302

System: Ubuntu 22.04 running on docker (Synology DS220+ with 2GB of RAM)
The error message is not really helping... problem of RAM? any hint to overcome this?

This is very likely to be due to insufficient RAM. If a single column of your DataFrame is stored as float64 then it needs 185341302*8 / 1024**3 GB = 1.4 GB, so two columns need 2.8 GB. Hence reading your CSV file into pandas consumes all of your RAM before you have used any datashader code.

You could try using a dtype of np.float32 so that your 2 columns only need 1.4 GB, but I suspect you will still run out of RAM when you try to do something useful with the data. You could switch to using a dask.DataFrame instead of a pandas.DataFrame, or of course use a machine with more RAM.

So it means that it is taking into RAM the whole dataset...so indeed, it will be complex, knowing that it was a "small" dataset to test ;)
Dask, why not, but I have first to learn what it is and how to do it ;)

It is not a massive data set, but 2 GB RAM is very small. Most mobile phones have more RAM than this, and I wouldn't attempt any serious calculations on a mobile phone.

Right, I'd try using a bigger machine. You can totally make it work on a smaller machine, e.g. by converting your dataset to Parquet and using Dask with persist=False to work "out of core", paging in chunks are you work with them. But it will be tricky to get that running while under the constraint of not being able to load in the full data even for converting it, and out of core work is vastly slower, so because I value my own time I'd switch to a suitable machine to work with.