Sparse tibble support
Closed this issue · 4 comments
This will serve as the main hub of issues across the tidymodels ecosystem, regarding the implementation of sparse data in tibbles.
right now we are still in the exploratory phase, with work happening in https://github.com/EmilHvitfeldt/sparsevctrs to implement sparse vector classes that can be used within a tibble.
another thing we can do with this framework is allow sparse data as inputs to functions such as vfold_cv()
, fit()
, predict()
etc etc, turning sparse data into sparse tibbles
- Sparse vector classes
- use sparse vector classes internally in recipes
- use sparse vector to matrix conversion in appropriate places
- recipes
- workflows
- parsnip (for xgboost)
- rsample
- make sure workflow/recipes interface is good to enable sparse data
- possibly having sparse data by default if conversion isn't a big downside
- make sure that end to end workflows uses and passes sparse data correctly
- consider improved printing to allow the user to better use sparsity
- a user might use 90% sparse compliant methods. letting them know what they would need to change to fully use sparsity (step_normalize() would be one example of a step that destroys sparsity)
- Have vignettes in appropiate places describing what sparsity is and which steps/models benefit from them
- document steps and models that work with sparsity and doesn't work with them
Steps in {recipes} according to whether they can use sparse vectors
Produce sparsity
- step_count()
- Doesn't need special function
- step_date()
- Doesn't need special function
- step_dummy()
-
sparse_dummy()
-
- step_dummy_extract()
-
sparse_dummy_extract()
-
- step_dummy_multi_choice()
-
sparse_dummy_multi_choice()
-
- step_holiday()
- step_indicate_na()
- maybe
- step_ordinalscore() (maybe)
- step_regex()
- step_time() (am output, techincally sparse, but since it only has 2 levels it isn't worth it)
- step_intercept()
- step_interact()
Modify sparsity
For sure
- step_impute_lower()
- step_impute_mean()
- step_impute_mode()
- step_impute_roll()
- step_sqrt() (if)
- step_scale()
- step_geodist()
- step_corrr()
- step_filter_missing()
- step_lincomb()
- step_nzv()
- step_rm()
- step_zv()
- step_lag()
- step_naomit()
- step_impute_roll() (harder)
- step_bin2factor()
- step_num2factor() (maybe)
Might work out of the box
- step_select()
- step_sample()
- step_shuffle()
- step_slice()
- step_arrange()
- step_filter()
- step_rename()
- step_rename_at()
I think
- step_classdist()
- step_classdist_shrunken()
- step_window()
Unaffected steps
- step_impute_bag()
- step_impute_knn()
- step_impute_linear()
- step_unknown()
- step_BoxCox()
- step_bs()
- step_harmonic()
- step_hyperbolic()
- step_inverse()
- step_invlogit()
- step_log()
- step_logit()
- step_mutate() (because the way it works)
- step_ns()
- step_poly()
- step_poly_berinstain()
- step_relu()
- step_spline_b()
- step_spline_convex()
- step_spline_monotone()
- step_spline_natural()
- step_spline_nonnegative()
- step_YeoJohnson()
- step_discretize()
- step_cut()
- step_factor2string()
- step_integer()
- step_novel()
- step_other()
- step_percentile()
- step_relevel()
- step_string2factor()
- step_unknown()
- step_unorder()
- step_center()
- step_normalize()
- step_range()
- step_classdist()
- step_ica()
- step_isomap()
- step_kpca()
- step_kpca_poly()
- step_kpca_rbf()
- step_mutate_at() (because the way it works)
- step_nnmf()
- step_nnmf_sparse()
- step_pca()
- step_pls()
- step_ratio()
- step_spatialsign()
- step_profile()
- step_rename()
- step_rename_at()
{themis} doesn't have any methods that apply.
{embed} only has step_feature_hash()
, but it is soft deprecated so I don't think it is worth it.
{textrecipes} has the following steps that produce sparsity
- step_texthash()
- step_tf()
- step_tfidf()
- step_dummy_hash()
The remaining are unaffected
closed in favor of linked issues