ENH: xorbits's read_parquet compatible with pandas on pyarrow engine
luweizheng opened this issue · 0 comments
Xorbits integrates the pyarrow backend. See this blog post for more info. And we also introduce use_arrow_dtype
in read_parquet
. If we install the pyarrow backend, Xorbits will detect it, marking use_arrow_dtype
to True
in the configuration and it will read parquet with arrow dtype. Some dtypes of pyarrow and pandas are different, for example, timestamp. Suppose time
is a timestamp column. If time
is a pandas dtype we can do it like this: df["time"].dt
. But pyarrow does not have dt
attribute.
If arrow is installed, xorbits use arrow and use_arrow_dtype
of the configuration is set as true. So here we read data in pyarrow format: https://github.com/xorbitsai/xorbits/blob/b1f1107af931e9101b22e4f1e000add3820297b5/python/xorbits/_mars/dataframe/datasource/read_parquet.py#L181C1-L201C18
We may include this in our document or change the default behavior of the ArrowEngine
when reading parquet files.