openml/OpenML

Parquet, categoricals and data-types

sebffischer opened this issue · 0 comments

I have noticed that there are some differences between the parquet and the arff files (e.g. the classes integer and double can be different between the two formats), furthermore the arrow-reader uses non-standard metadata to encode the categoricals (see this issue: duckdb/duckdb#3309 (comment)).
The arrow library however is really unusable in R (multiple people reported that), I am not sure how it would be in julia or Java (?)
Also the "features" metadata currently does not provide enough information to ensure that the parsed arff files and the parsed parquet files are really identical (by converting the columns)