Task splits as parquet files

Question

Task splits as parquet files

sebffischer opened this issue 2 years ago · 1 comments

Are there plans to also provide the task splits as parquet files in the future?
This would allow us to remove the arff dependencies (once all the datasets are successfully migrated).

As an example wrt to the storage size, here the file-size of the NYC taxi dataset in parquet and arff.

library(mlr3oml)
library(duckdb)
#> Loading required package: DBI

otask = OMLTask$new(359943)
task_splits = otask$task_splits
#> INFO  [12:21:06.213] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/359943`, authenticated: `TRUE`}
#> INFO  [12:21:06.955] Retrieving ARFF {url: `https://api.openml.org//api_splits/get/359943/Task_359943_splits.arff`, authenticated: `TRUE`}

file_arff = tempfile(fileext = ".arff")
file_parquet = tempfile(fileext = ".parquet")

con = DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "tbl", task_splits, row.names = FALSE)
DBI::dbExecute(con, sprintf("COPY tbl TO '%s' (FORMAT 'PARQUET', CODEC 'ZSTD') ", file_parquet))
#> [1] 5818350
mlr3oml::write_arff(task_splits, file_arff)

file.size(file_parquet) / file.size(file_arff)
#> [1] 0.1619774

^{Created on 2022-08-30 by the reprex package (v2.0.1)}

Answer 1 · 2022-09-03T19:16:38.000Z

Yes, that is the plan :).