ddsjoberg/gtsummary

issue when by variable has many values

Closed this issue · 14 comments

I was implementing the proposal https://stackoverflow.com/questions/79238011/report-results-to-4-decimal-places-when-using-add-ci-function-in-gtsummary-packa

The reproducible example is:

library(reprex)
#> Warning: package 'reprex' was built under R version 4.3.3
library (pacman)
p_load (devtools, readxl, srvyr, survey, dplyr, gtsummary, tidyr, stringr)

df_reconstructed <- source_gist("7fbffa47a0786e8223528719738504e2")
#> ℹ Sourcing gist "7fbffa47a0786e8223528719738504e2"
#> ℹ SHA-1 hash of file is "b3b039d93f7fcc9f4962aa43b10487b47c8de71f"
df <- df_reconstructed$value

my_style <- label_style_number(digits = 4)
varlist = c("bcg", "dpt1", "dpt2", "dpt3", "mr1")

results <- df |>
 as_survey_design(strata = v022, weights = v005) |>
 gtsummary::tbl_svysummary(
   by = v024, # Grouping variable
   type = where(is.numeric) ~ "continuous",
   statistic = list(varlist ~ "{mean} {N_obs_unweighted}"),
   missing = "no",
   digits = list(varlist ~ c(4, 0)),
   include = varlist
 ) |>
 add_ci(pattern = "{stat} (95% CI {ci})",
        style_fun = list(varlist ~ my_style)) |>
 as.data.frame()
#> Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'rowSums': Can't convert `x` <haven_labelled> to <character>.

Created on 2024-12-01 with reprex v2.1.1

Please note the syntax was working well before updating my my gtsummary to version 2.0.3
Thanks for the help

Thank you for your post.

I prefer not to download zip files onto my machine. Please update your post with a reproducible example, aka a reprex using the reprex R package. A reprex includes both data and code I can run on my machine that replicates your finding. The reprex also runs in a fresh environment, to minimize the scope of the issue you are reporting. Make the example as short as possible: the minimal amount of code to reproduce your finding and nothing more.

Take just a few minutes to review reprex.tidyverse.org for detailed instructions on how to create a reprex.

Thanks Daniel. I think my case is unique since I am working with a dataset from two stage sampling approach. I have tried to sample the dataset to about 2K observations but got an error as a result of design effects. Not sure of a way of sharing a reproducible example in this case.

There are multiple ways, and here's one: You can try using dput() on your data frame, or serialize() on more complicated objects. You can then store the result in a GitHub Gist and use devtools::source_gist() to read the data into your reprex. But it'll be important to keep your data set as minimal as possible (e.g. only keep needed columns) and keep the code as minimal as possible. Hope that helps.

Thanks! I had used dput and sink to store data to R script. The dataset has the necessary columns only. Let me explore GitHub Gist.

I have updated the original post. Let me know if it's fine.

This is great, thanks. Can you make the update to run using reprex::reprex() because this will run in a fresh environment. Also, can you remove all unnecessary code, e.g. only include one variable in the summary, remove all pre-processing steps you can and still replicate your issue, do you need a by variable, etc. Thanks for updating

Also, I know that all these steps may sound like a bit much. But in my experience more than 50% of issues are solved by the OP while preparing their reproducible examples, so it significantly cuts the time I am investigating issues. Thanks again

Thanks so much for the guidance. I sincerely appreciate since it has been a great learning experience

Hi @sokiya , Thanks for updating, it looks like your error appears before you get to the gtsummary code:

library (pacman)
p_load (devtools, readxl, srvyr, survey, dplyr, gtsummary, tidyr, stringr)

df_reconstructed <- source_gist("7fbffa47a0786e8223528719738504e2")
#> ℹ Sourcing gist "7fbffa47a0786e8223528719738504e2"
#> ℹ SHA-1 hash of file is "b3b039d93f7fcc9f4962aa43b10487b47c8de71f"
df <- df_reconstructed$value

df |>
  as_survey_design(strata = v022, weights = v005)
#> Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'rowSums': Can't convert `x` <haven_labelled> to <character>.

Created on 2024-12-04 with reprex v2.1.1

@ddsjoberg thanks for letting me know. I have corrected the issue in the syntax below which demonstrates the issue I was facing - the code was running endlessly hence can't utilize the reprex output

reprex({
  
  library(devtools)
  library(reprex)
  library(survey)
  library(srvyr)
  library (gtsummary)
  library(haven)
  
  df_reconstructed <- source_gist("7fbffa47a0786e8223528719738504e2")
  df <- df_reconstructed$value
  
  my_style <- label_style_number(digits = 4)
  varlist = c("bcg", "dpt1", "dpt2", "dpt3", "mr1")
  
  results <- df %>%
    mutate(across(where(is.labelled), as_factor))|>
    as_survey_design(strata = v022, weights = v005) |>
    gtsummary::tbl_svysummary(
      by = v024, # Grouping variable
      type = where(is.numeric) ~ "continuous",
      statistic = list(varlist ~ "{mean} {N_obs_unweighted}"),
      missing = "no",
      digits = list(varlist ~ c(4, 0)),
      include = varlist
    )
  
})

Ah, I see, you're running out of memory on your machine. You can try increasing the memory? It's probably not something i'll actively try to remedy from the gtsummary side, since we focus on tables for publication and I can't say I've seen a published table with nearly 50 columns of by variable levels. If you wish to delve into the details of the code and have suggestions for improvements, we could certainly discuss. Happy Programming!

Thanks @ddsjoberg . I don't think it is a memory issue but something that came up in later versions of gtsummary. I have used the package for several analysis and everything worked well even in this instance until I wanted to have the results to 4 decimal places. I then followed the guidance https://stackoverflow.com/questions/79238011/report-results-to-4-decimal-places-when-using-add-ci-function-in-gtsummary-packa which required an upgrade of the package and resulted into the current experience.
My analysis is a replication of the Demographic and Health Survey analysis at admin one level for Kenya which is common even for other countries. Motivation for fixing the bug :)

You can down version the package and still report results to 4 decimal places. Also, i did check, and it is indeed a memory issue. In the 2.0 release, there was a re-organization of the internals which included a re-write of the svy summary functions, which could be less memory efficient than the last iteration. If you wish to delve into the details of the code and have suggestions for improvements, we could certainly discuss.

Thanks! Which version should I degrade to? Let me know how I can review details of the code in order to provide suggestions.