ContinuumIO/anaconda-package-data

Recent versions of Pandas experience PyArrow errors through `intake` and `condastats` use of anaconda-package-data

Opened this issue · 0 comments

Thank you for making this data and the documented methods available - fantastic stuff!

I noticed when attempting to use the intake methods from the README.md there are Pandas PyArrow errors when using recent versions of Pandas (>=v2.0.0). This appears to also effect condastats though maybe through different means. I imagine but don't know whether this could be a Pandas or Dask DataFrame issue at the core, but also wondered about data type management within the Parquet files related to this repo (for ex. are there incompatible types which users should be made aware of?). While it might be an external issue in terms of a fix, maybe this issue could help with increased or updated documentation here.

Specifically, the errors I most often saw were:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

There also may have been errors regarding "Pandas categorical types".

I worked around the issue by looking at the last modified date of the README.md (around January 2020) and installing a version of Pandas from around that time (v1.3.5 worked for me).