Create a complete README example of using codebook
Closed this issue · 7 comments
Create a simple example of using codebook from start to finish to go on README. For now, I think that's all we'll need. We can add a vignette later if we really need to.
Note: Because this was my fist attempt at using codebook
since it was in bfuncs (and the code was somewhat outdated), a lot of changes came up in the process of getting creating a simple example for README. Rather than create a separate issue for each of them, I document them below.
I have all of the codebook R scripts moved over to the project files. I haven't used them in a long time, though. I can simultaneously relearn how to use the functions and create a complete example directly on the README.
Working on README and got kind of off track. Here's what's going on.
I decided to change all the codebook prefixes to cb (5e3092f and 9071bdb)
- Then I ran
document()
to update the documentation- That raised a warning:
Topic 'cb_summary_stats_to_ft': no parameters to inherit with @inheritParams
. I wasn't sure how to fix it or how big of a deal it was.- So, I ran
check()
, which raised another warning:'::' or ':::' imports not declared from: ‘bfuncs’
. But, bfuncs is not supposed to be in codebook anymore. So I searched for the file(s) using bfuncs. It was just one:cb_summary_stats_few_cats()
- Then, I tried to replace
bfuncs::freq_table()
withfreqtables::freq_table()
. But I thought, "If I'm going to do that, I should go ahead and create a unit test to help future proof it."
- Then, I tried to replace
- So, I ran
- That raised a warning:
That's where I'm at now. I need to work my way back up through this list and finish README.
- Remove bfuncs
- Clearn up
check()
warnings - Fix interits
- Run document
Categorical stats and factors
While running check()
I came across a problem with cb_summary_stats_few_cats
. After changing sex
from a character vector to a factor vector in the example study data, this chunk of code no longer worked:
# Change the category label for missing values from NA to "Missing"
dplyr::mutate(cat = tidyr::replace_na(cat, "Missing"))
It doesn't work because if tidyr::replace_na
were to change NA to "Missing" it would essentially be adding a new factor level, which is beyond its scope. So, we have to first change the variable to a character vector and then change NA to "Missing". This does NOT change the variable from a factor vector to a character vector in the main data frame -- only in the frequency table data frame that will be inserted into the codebook.
Here is the new code:
dplyr::mutate(
cat = as.character(cat),
cat = tidyr::replace_na(cat, "Missing")
)
We also made a similar change to cb_summary_stats_many_cats
.
After making this change and running check()
again, there was an error raised by the following code from cb_summary_stats_many_cats
:
lowest <- df %>%
dplyr::group_by({{ .x }}) %>%
dplyr::summarise(n = n()) %>%
Essentially, the error wanted me to add dplyr::
to n()
. Instead, I decided to replace this code with
lowest <- df %>%
dplyr::count({{ .x }}) %>%
(The same applies to the code that calculates highest
)
- Fix unit test for cb_summary_stats_few_cats
- Make sure
cb_summary_stats_many_cats
works with factors - Create a unit test for
cb_summary_stats_many_cats
- Fix build checks
Update to rlang 0.4.0
In the process of trying to make this change, it turns out that rlang::sym(.x)
is no longer valid code. It returns the following error: Error in ``rlang::sym()`` at codebookr/R/cb_summary_stats_many_cats.R:19:2: Can't convert a function to a symbol.
. As of rlang 0.4.0, this line of code is unnecessary. I'm removing it from the code and replacing all of the !!x
s with {{ .x }}
-- the new preferred tidy evaluation syntax.
- Update the all of the tidy evaluation code to use rlang 0.4.0 (i.e., curly curly) sytax where appropriate.
- Fix build checks
Updates to the codebook function
I'm finally to the point where I'm running the codebook()
function and I've run into a couple of issues.
- I got an error that says
Error in ``group_by()``: ! Must group by variables found in ``.data``. ✖ Column ``.x`` is not found.
- I want to change the example code to use the study data
But there are some bigger, more fundamental changes I want to make too.
Remove the path argument
Currently, the codebook()
function essentially expects you to pass it the data frame twice in two different ways:
- It expects you to pass an in-memory (i.e., in the global environment) version of the data frame to the
df
argument. This version of the data frame is the one that gets most of the action inside thecodebook
function. - It expects you to pass a path to an on-disk (i.e., saved as .csv, .rds, etc.) version of the data frame to the
path
argument. It looks like this is only used to gather the last modified data value for the metadata table in the codebook.
So, the only purpose of the path argument is to gather the last modified data value for the metadata table in the codebook; yet, it comes with several downsides.
- Having a
df
andpath
argument in thecodebook()
function is confusing for the user. What if there isn't an on-disk version of the data frame? What if there are multiple on-disk versions of the data frame saved in different formats? Which one do you use? - I had to copy a bunch of code from
utils:::format.object_size
. There's a note that says I copied the code because CRAN won't allow me to use:::
inside of my function. All of this code can be removed if I get rid of thepath
argument.
Instead, I can add a Last updated
value to the metadata table. This would get around the downsides of using the path argument, and the date the codebook was last updated is probably no less useful than the date the file was last modified in most cases.
The group_by() error
After removing the path argument, I came back to the group_by error. I don't have it completely figured out yet, but it seems like removing all of the equo()
syntax and replacing it with the curly-curly syntax is changing the way that .x
is being passed down through the cascade of functions that create the summary tables.
For example, id
, the first column in study
gets passed to the cb_add_summary_stats()
function inside of a loop inside of the codebook
function via the .x
argument (i.e., cb_add_summary_stats(col_nms[[i]])
).
It then gets passed to the cb_summary_stats_many_cats()
function inside of the cb_add_summary_stats()
function via the .x
argument (i.e., cb_summary_stats_many_cats(df, .x, n_extreme_cats)
).
It appears as though this is where the problem is. It's getting passed to cb_summary_stats_many_cats()
as a literal .x
instead of a quoture. So, I think I need to add the enquo()
syntax back to cb_add_summary_stats()
.
The fix for the group_by error was to use .data[[.x]]
syntax instead of {{ .x }}
syntax. I documented the solution here
dplyr Error in stop_vctrs()
:! x
must be a vector, not a <> object when a class is added
The cb_add_summary_stats()
function adds a new class to each of the summary stats data frames. The cb_summary_stats_to_ft()
function uses that class to determine which method to use to make a flextable from the summary stats data frame.
When the data frame with the added class was passed to the line dplyr::mutate(across(everything(), as.character))
in cb_summary_stats_to_ft.summary_many_cats
, it was returning the following error: Error in
stop_vctrs(): !
x must be a vector, not a <tbl_df/tbl/data.frame/make_char> object.
. After Googling a little bit, I found the following in the breaking changes section of the changelog for dplyr 1.0.0:
Extending data frames requires that the extra class or classes are added first, not last. Having the extra class at the end causes some vctrs operations to fail with a message like: Input must be a vector, not a
<data.frame/...>
object
Adding the new class to the front of the class list fixes the problem.
Now, I think I need to add a logical vector and a pure time vector for testing.
Checking to make sure the user isn't piping the data frame into codebook
One of the codebook checks is to make sure the user doesn't pipe the data frame into the codebook
function. When they do, Dataset name:
in the metadata table (below) is ".". I wanted to fix that, but I don't think it's going to be possible (https://stackoverflow.com/questions/42560389/get-name-of-dataframe-passed-through-pipe-in-r). So, I updated the message to be a little more clear instead.
- Remove the path argument
- Add last updated date to metadata table
- Fix group_by error
- Change the example code to use the study data
- Add logical vector to study
- Add a time vector to study
- Add likert to study
- Deal with time vector in cb_add_summary_stats
- Create test for cb_add_summary_stats
- Improve documentation for the cb_add_summary_stats arguments - use language in test
- Create test for codebook
- Figure out what to do with officer::body_remove() on line 57 of
codebook
- Run checks
- Push commits
Clean up the Word document formatting
- Remove NULLs from the codebook tables
- Fix the Frequeny typo
- Remove or add use defaults for column description, source information, and column type
- Time doubles all of the attributes in the column attributes table. Concatenate into a single string.
- Fix the double space between "All" and "20" for the mode value of the time column
Add an example of using data imported from Stata
I think completing this will also help with #9
Haven labeled columns
When you import data from SAS, Stata, or SPSS using Haven, it adds two classes to variables with value labels: haven_labelled
and vctrs_vctr
. Passing these columns to codebook()
results in the following error:
Error in cb_add_summary_stats(., col_nms[[i]]) :
Column sex is of unknown type. Please set the col_type attribute
One way to get around this is simply to set the col_type
attribute like this:
study <- study %>%
cb_add_col_attributes(sex, col_type = "Categorical")
However, because Haven labeled data is so common, we decided to specifically look for and remove those classes in cb_summary_stats.R
. It should not remove those classes from the column generally -- just for the process of determining the column type and calculating descriptive statistics.
Using Haven labels
When we import data from Stata, SAS, or SPSS with labels, the attributes are called $label
for variable labels and $labels
for value labels. Currently, codebook()
cannot automatically make use of those attributes because it only recognizes the attributes description
, source
, and col_type
. It's relatively easy to manually set the value of the description
attribute to the value of the label
attribute like this
attr(study$sex, "description") <- attr(study$sex, "label")
Which can be extended in a for loop. However, because Haven labeled data is so common, we decided to specifically look for $label
and $labels
in cb_get_col_attibutes
.
- Figure out if I need to do anything special to use the data in inst/extdata
- Build to make sure there are no issues with the files that are being added (e.g., coderbookr_graphics.pptx)