pandas-dev/pandas

DOC: Document the filters argument in read_parquet

MrPowers opened this issue · 5 comments

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

Documentation problem

The filters argument is massively important when reading Parquet files, but it's currently undocumented. It is documented in the Dask documentation: https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html

I actually didn't even think pandas supported this argument, but apparently, it's supported & undocumented.

Suggested fix for documentation

I think the Dask documentation can carry over pretty well to the pandas documentation, but we should use language that's easier to understand. I don't think we should use the "disjunctive normal form (DNF)" terminology - that's just unnecessary.

Let me know if the community supports this fix and I'd be happy to draft some language.

The difficulty in documenting these arguments is that the fastparquet vs pyarrow engines accept different extra arguments which is documented to be passed in via **kwargs. To generalize, might be good to link to each engines' documentation

@mroeschke - Yea, I think fastparquet supports a subset of the pyarrow functionality. pyarrow got 66.5 million downloads and fastparquet only got 2.8 million downloads, so think pyarrow is a lot more important. I do think this feature is critical because it lets pandas users read in less data. pandas users often face memory issues and this option could be the difference between their analysis running / it not running at all. This can also be a huge performance gain. I honestly didn't think it was supported. So I think the community would really benefit from documentation. Thank you!