sfirke/janitor

Should tabyls always be tibbles?

Opened this issue ยท 10 comments

Feature requests

I think it would be nice to print tabyl frames as tibbles! Especially since "janitor is a #tidyverse-oriented package."

It's a small thing, but consistency just makes the whole process smoother. In the example below, you can see how the tibble object makes it clear that color is an ordinal variable and removes non-significant digits from percent.

library(janitor) # '1.2.0'
library(ggplot2) # '3.2.1'
library(tibble)  # '2.1.3'

t <- tabyl(diamonds, color)
print(t)
#>  color     n    percent
#>      D  6775 0.12560252
#>      E  9797 0.18162773
#>      F  9542 0.17690026
#>      G 11292 0.20934372
#>      H  8304 0.15394883
#>      I  5422 0.10051910
#>      J  2808 0.05205784
as_tibble(t)
#> # A tibble: 7 x 3
#>   color     n percent
#>   <ord> <dbl>   <dbl>
#> 1 D      6775  0.126 
#> 2 E      9797  0.182 
#> 3 F      9542  0.177 
#> 4 G     11292  0.209 
#> 5 H      8304  0.154 
#> 6 I      5422  0.101 
#> 7 J      2808  0.0521

Created on 2019-09-30 by the reprex package (v0.3.0)

Edit: Use reprex::reprex() for example

I prefer tibbles, too, in general. Looking back at #44, the hassle of tibbles not printing all their rows was the deciding factor in moving to data.frame.

Now I suppose the print.tabyl() method could pass a large value to n so that say, the first 100 rows print. Then would tibbles be preferable? I like the truncating of the digits, I think, and the labeling of the column vars. What do others think?

There might be implementation wrinkles I'm not thinking of b/c all of the tabyl class and metadata info would have to be attached to a tbl, but I think it could be done.

(also a small note, when you do the dplyr::arrange() it strips the tabyl of its tabyl class which does have its own print method that does not show line numbers. Maybe a better example without the arrange. I'm only pointing that out because we are comparing outputs of the print methods and that's not actually print.tabyl())

ggplot2::diamonds %>% 
+      janitor::tabyl(color)
 color     n    percent
     D  6775 0.12560252
     E  9797 0.18162773
     F  9542 0.17690026
     G 11292 0.20934372
     H  8304 0.15394883
     I  5422 0.10051910
     J  2808 0.05205784

I think passing a larger number into print(n = ) makes a lot of sense. I often do that myself when exploring an object. 100 rows is already much more manageable than 1000, especially when the overflow columns are hidden.

Well, now I'm on board with prioritizing this and saying tabyl should always return a tibble. This just burned me. I couldn't figure out why a line like x == "foo" was not matching in a case_when recoding that I was feeding into tabyl. Turns out the value had leading whitespace, " foo". Adding as_tibble() made that immediately evident.

I agree that this is a good idea. I"m often converting tabyls to tibbles to work with them further if it's not for immediate interactive checking or for printing in a report.

I would also hope that it's a rare case for someone to be creating a tabyl with more than 50 rows; at that point it seems unlikely that they have a var that is actually a meaningful categorical var. So I think 50 makes sense. The case where it would be more would be 3-way+, in which case perhaps it would make sense to start reducing the number of printed rows per tabyl to limit the overall to 50 or 100.

I think the only drawback here is what happens if I create a basic tabyl object earlier, and then I want to adorn things to it later on in my code? Without it being a tabyl object, would these functions still work without lots of recoding?

Edit: Just realized we're talking about "printing". Would the object remain a tabyl, but only be printed as a tibble? If so my above drawback is moot.

Well something simple like tibble::as_tibble() changes the class and removes the tabyl part. But you can have a data frame with both classes that prints like a tibble but should (?) keep all the aspects of a tabyl object.

The lightest change would be for print.tabyl to convert the tabyl to a tibble at print time. That's adding just a single line of code. But I wonder if that would confuse people, because the tabyl would look like a tibble to the user but not actually be one.

I could try rewriting the tabyl class to also be a tibble, as well as changing print.tabyl. Funnily, right now mtcars %>% count(gear) %>% as_tabyl() %>% class gets you an object that is also a tibble but prints like it's not a tibble, kind of backwards of the first option here.

This rewriting might take more work, and could surface problems I'm not thinking of right now... but right now I feel like if a tabyl is going to look like a tibble when it prints, it should be a tibble.

Yes, I see your points. I agree that printing something that looks different from what the object actually is is not a good idea and would be confusing.

I'm still on board with making tabyls always be tibbles but I'm not including this in v2.2, it involves updating many many tests.

This issue has been around for more than two years.

I do not expect tabyls to always output tibbles. However, it should retain the property of the input dataframe. Removing the tibble property of a input dataframe is highly unexpected.