Codebook is slow
Closed this issue · 1 comments
While running the codebook function on the L2C data, I realized how slow it is. In some ways, this may not be a huge issue because we probably want need to recreate codebooks often. Having said that, it might be nice to try to find ways to speed up the code.
https://www.r-bloggers.com/2021/04/code-performance-in-r-which-part-of-the-code-is-slow/
http://adv-r.had.co.nz/Performance.html
Using HTML instead of Word (#5) might be a good way to speed it up.
- flextable can create HTML tables
- Need to figure out how to stitch them together into a basic HTML document
Solution
The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents
Inserting a document of course allows you to integrate a previously-created Word document into another document. This can be useful when certain parts of a document need to be written manually but automatically integrated into a final document. The document to be inserted must be in docx format. This can be done by using function
body_add_docx()
. This can be advantageous when you are generating huge documents and the generation is getting slower and slower. It is necessary to generate smaller documents and to design a main script that inserts the different documents into a main Word document.
- Clean up codebook2 code
- Move codebook2 code over to codebook and delete codebook2
- Change version number
- Document
- Check
- Commit
Working on issue #17. Codebook is slow.
library(dplyr)
library(codebookr)
library(microbenchmark)
library(profvis)
data(study)
data_stata <- haven::read_dta("inst/extdata/study.dta")
How long does it take to run regular data?
microbenchmark(
codebook(study),
times = 10L
) # 2-3 seconds each run.
How long does it to run on Stata data?
microbenchmark(
codebook(data_stata),
times = 10L
) # 2-3 seconds each run
So, that doesn't seem to make a huge difference.
What are the slow parts?
profvis(codebook(study))
The Flextable stuff is the slowest part. I'm not sure if I can speed that up or not.
profvis(codebook(data_stata))
Flextable stuff for this one too.
Can I do the Flextable stuff at once outside of a loop? Will that make any difference?
Do more rows slow it down?
df_short <- tibble(x = rnorm(100)) # 100 rows
df_medium <- tibble(x = rnorm(10000)) # 10,000 rows
df_long <- tibble(x = rnorm(10000000)) # 10,000,000 rows
microbenchmark(
codebook(df_short), # Mean = 347 milliseconds
codebook(df_medium), # Mean = 1589 milliseconds
codebook(df_long), # Mean = 4212 milliseconds
times = 10L
)
So, adding more observations slows it down.
100 to 10,000 = 4 times as long
100 to 10,000,000 = 12 times as long
Do more columns slow it down?
# Keep the first 100 rows of df_long only
df_medium <- df_medium[1:100,]
# Make 100 column names from combinations of letters
set.seed(123)
cols <- unique(paste0(sample(letters, 100, TRUE), sample(letters, 100, TRUE), sample(letters, 100, TRUE)))
for (col in cols) {
df_medium[[col]] <- rnorm(100)
}
microbenchmark(
codebook(df_short), # Mean = 300 milliseconds
codebook(df_medium), # Mean = 52776 milliseconds (52 seconds)
times = 1L
)
So, adding more columns slows it down A LOT!
1 to 100 = 175 times as long!
What parts of the code take the longest to run?
profvis(codebook(df_short))
The flextable parts take the longest (i.e., body_add_flextable and regular_table).
profvis(codebook(df_medium))
The flextable parts take the longest (i.e., body_add_flextable, body_add_par, and regular_table).
profvis(codebook(df_long))
unique.default and cb_add_summary stats take the longest.
There isn't a way for me to change the internals of the flextable functions, but I do wonder if me applying them in a different way would speed things up?