openml/OpenML

Task splits as parquet files

sebffischer opened this issue · 1 comments

Are there plans to also provide the task splits as parquet files in the future?
This would allow us to remove the arff dependencies (once all the datasets are successfully migrated).

As an example wrt to the storage size, here the file-size of the NYC taxi dataset in parquet and arff.

library(mlr3oml)
library(duckdb)
#> Loading required package: DBI

otask = OMLTask$new(359943)
task_splits = otask$task_splits
#> INFO  [12:21:06.213] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/359943`, authenticated: `TRUE`}
#> INFO  [12:21:06.955] Retrieving ARFF {url: `https://api.openml.org//api_splits/get/359943/Task_359943_splits.arff`, authenticated: `TRUE`}

file_arff = tempfile(fileext = ".arff")
file_parquet = tempfile(fileext = ".parquet")

con = DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "tbl", task_splits, row.names = FALSE)
DBI::dbExecute(con, sprintf("COPY tbl TO '%s' (FORMAT 'PARQUET', CODEC 'ZSTD') ", file_parquet))
#> [1] 5818350
mlr3oml::write_arff(task_splits, file_arff)

file.size(file_parquet) / file.size(file_arff)
#> [1] 0.1619774

Created on 2022-08-30 by the reprex package (v2.0.1)

Yes, that is the plan :).