openml/OpenML

Question: availability of parquet files

glemaitre opened this issue · 8 comments

In scikit-learn, we were about to bring a simple new ARFF parser based on pandas.read_csv. In short, it skipped the header, read the dataset and cast the nominal columns (we don't really care about the datetime format). It is from x4-x10 faster and take x2 less memory.

However, we now wonder if we should indeed integrate this parser since it could become obsolete. Basically, it would depend on the timing regarding making the dataset available in parquet format through the OpenML site. I saw in some previous issue that it could be available soon.

Do you have an estimate (even rough) of the timeline for the feature to land?

Hi Guillaume,

I estimate it will be around February 2022. We've converted most of the datasets to parquet, but some take longer (e.g. sparse datasets). Including @prabhant to follow up on this.

This also needs to be merged first: #1097

@mfeurer @joaquinvanschoren I wanted to know if there is some news regarding the parquet format. We could see that the ARFF file is defaulting to old.openml.org and I could as well see that there are some .pq links in the XML.

I was wondering if it would be safe on our side to rely on this info to load parquet dataset?
Is there some case that which only the ARFF file will be available and not the parquet file?

Hi Guillaume,

The only edge case not fully covered yet are sparse datasets. @prabhant can you please give an update?
Also, are we renaming 'minio_url' to 'parquet_url'?

Otherwise, it is safe to start using it.
Please note that old.openml.org will stay for a good while, also in production. This is to simplify development of an entirely new backend in python.

The only edge case not fully covered yet are sparse datasets.

Cool. At least, we can detect this case by looking at the tag and raising the proper error message to our user to switch on/off parquet.

Yes, or more generally you can also easily detect when the parquet URL is not available.

Hi, right now it's safe to use the parquet URL for datasets available there. (You'll get an error or 403 if its not available). We are done with converting sparse datasets as well. So after that only very few edge cases will be left(mostly broken datasets that can't even be loaded in pandas).

Note that the first priority of uploading datasets are datasets with 'active' label. After that we will start uploading inactive ones.

The current estimate of uploading sparse datasets is by the first week of June.

Hi Guillaume, we are renaming minio_url to parquet_url in the API.
We are returning both. Please let me know when you are no longer using minio_url.
When there are no more dependencies, we'll remove minio_url.
Thanks!

@joaquinvanschoren We did not implement yet the feature so we can use directly parquet_url.