pangeo-data/pangeo-stacks

need parquet support

Opened this issue · 3 comments

I tried to load a dask dataframe from parquet and was told I needed to install either fastparquet or pyarrow for this to work.

I think one of these should be in our notebook base image. But which one?

cc @martindurant.

Pyarrow definitely has more development time going into it than fastparquet these days, and I think perhaps it's caught up to fastparquet feature-wise.

The other thing to consider here is probably image size. According to https://anaconda.org/conda-forge/fastparquet/files, fastparquet is ~250Kb, while pyarrow is ~3MB. Though fastparquet also requires llvmlite / numba (not sure if that's already in the image).

I've been using fastparquet in the ESIP notebook container for the past year or so, and it meets my needs, which is pretty much just to create/read some parquet files using dask dataframe.

There are probably a few things still that fastparquet can do, and it gives more pythonic access to the internal data structures, which can sometimes be helpful. However, pyarrow is more common, more standard and (these days) probably faster in many applications. Since I myself and not pushing fastparquet development, any additional features that might be required would need someone's effort. I can't say how quickly the arrow community might respond to requests for features.