Status and roadmap

Question

Status and roadmap

martindurant opened this issue 8 years ago · 9 comments

Answer 1 · 2016-10-20T18:04:35.000Z

Can I make a request for an additional section for Administrative topics like packaging, documentation, etc..

What do we need for Dask.dataframe integration? Presumably we're depending on dask.bytes.open_files?

Answer 2 · 2016-10-20T18:22:18.000Z

Yes, passing a file-like object that can be resolved in each worker would do: core.read_col currently takes an open file-object or a string that can be opened within the function. It probably should take a function to create a file object given a path (a parquet metadata file will reference other files with relative paths).
The only places that reading actually happens is core.read_thrift (where the size of the thrift structure is not known) and core._read_page (where the size in bytes is known). The former is small and would fit within a read-ahead buffer, the latter can be formed in terms of dask's read_bytes.

Answer 3 · 2016-12-03T08:24:44.000Z

Hi there,

Are you guys aware of ongoing PyArrow development? It is also already on conda-forge and also has pandas <-> parquet read/write (through Arrow), although I don't think it supports multi-file yet.

Answer 4 · 2016-12-03T11:44:49.000Z

@lomereiter Yes, we're very aware. We've been waiting for comprehensive Parquet read-write functionality from Arrow for a long while. Hopefully fastparquet is just a stopgap measure until PyArrow matures as a comprehensive solution.

Answer 5 · 2016-12-07T19:48:38.000Z

Hi, amazing work. Two things I noticed:

pytest required at runtime (imported in utils.py) which is a bit unusual
if column names are not string types then saving fails (e.g. AttributeError: 'int' object has no attribute 'encode')

Answer 6 · 2017-03-01T19:56:41.000Z

Since @lomereiter mentioned PyArrow, I will just leave this link here: Extreme IO performance with parallel Apache Parquet in Python

Answer 7 · 2017-03-01T20:26:00.000Z

Thanks @frol . That there are multiple projects pushing on parquet for python is a good thing. You should also have linked to the previous posting python-parquet-update (Wes's work, not mine) which shows that fastparquet and arrow have very similar performance in many cases.

Note also that fastparquet is designed to run in parallel using dask, allowing distributed data access, and reading from remote stores such as s3.

Answer 8 · 2017-03-01T20:29:54.000Z

@martindurant Thank you! I was actually looking out there for some sorts of benchmarks for fastparquet as I am going to use it with Dask. It would be very helpful to have some info about benchmarks in the documentation as "fast" suffix in the project name implies the focus on speed, but I failed to find any info on this until you pointed me to this article.

Answer 9 · 2017-03-01T20:37:02.000Z

There are some raw benchmarks in https://github.com/dask/fastparquet/blob/master/fastparquet/benchmarks/columns.py

My colleagues at datashader did some benchmarking on census data at the time when we were focusing on performance. Their numbers include both loading and performing aggregations on the data.

Status and roadmap

Reading

Writing

Admin

Features not to be attempted