Task splits as parquet files
sebffischer opened this issue · 1 comments
sebffischer commented
Are there plans to also provide the task splits as parquet files in the future?
This would allow us to remove the arff dependencies (once all the datasets are successfully migrated).
As an example wrt to the storage size, here the file-size of the NYC taxi dataset in parquet and arff.
library(mlr3oml)
library(duckdb)
#> Loading required package: DBI
otask = OMLTask$new(359943)
task_splits = otask$task_splits
#> INFO [12:21:06.213] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/359943`, authenticated: `TRUE`}
#> INFO [12:21:06.955] Retrieving ARFF {url: `https://api.openml.org//api_splits/get/359943/Task_359943_splits.arff`, authenticated: `TRUE`}
file_arff = tempfile(fileext = ".arff")
file_parquet = tempfile(fileext = ".parquet")
con = DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "tbl", task_splits, row.names = FALSE)
DBI::dbExecute(con, sprintf("COPY tbl TO '%s' (FORMAT 'PARQUET', CODEC 'ZSTD') ", file_parquet))
#> [1] 5818350
mlr3oml::write_arff(task_splits, file_arff)
file.size(file_parquet) / file.size(file_arff)
#> [1] 0.1619774
Created on 2022-08-30 by the reprex package (v2.0.1)
joaquinvanschoren commented
Yes, that is the plan :).