brad-cannell/codebookr

Create a complete README example of using codebook

Closed this issue · 7 comments

Create a simple example of using codebook from start to finish to go on README. For now, I think that's all we'll need. We can add a vignette later if we really need to.

Note: Because this was my fist attempt at using codebook since it was in bfuncs (and the code was somewhat outdated), a lot of changes came up in the process of getting creating a simple example for README. Rather than create a separate issue for each of them, I document them below.

I have all of the codebook R scripts moved over to the project files. I haven't used them in a long time, though. I can simultaneously relearn how to use the functions and create a complete example directly on the README.

Working on README and got kind of off track. Here's what's going on.
I decided to change all the codebook prefixes to cb (5e3092f and 9071bdb)

  • Then I ran document() to update the documentation
    • That raised a warning: Topic 'cb_summary_stats_to_ft': no parameters to inherit with @inheritParams. I wasn't sure how to fix it or how big of a deal it was.
      • So, I ran check(), which raised another warning: '::' or ':::' imports not declared from: ‘bfuncs’. But, bfuncs is not supposed to be in codebook anymore. So I searched for the file(s) using bfuncs. It was just one: cb_summary_stats_few_cats()
        • Then, I tried to replace bfuncs::freq_table() with freqtables::freq_table(). But I thought, "If I'm going to do that, I should go ahead and create a unit test to help future proof it."

That's where I'm at now. I need to work my way back up through this list and finish README.

  • Remove bfuncs
  • Clearn up check() warnings
  • Fix interits
  • Run document

Categorical stats and factors

While running check() I came across a problem with cb_summary_stats_few_cats. After changing sex from a character vector to a factor vector in the example study data, this chunk of code no longer worked:

# Change the category label for missing values from NA to "Missing"
dplyr::mutate(cat = tidyr::replace_na(cat, "Missing"))

It doesn't work because if tidyr::replace_na were to change NA to "Missing" it would essentially be adding a new factor level, which is beyond its scope. So, we have to first change the variable to a character vector and then change NA to "Missing". This does NOT change the variable from a factor vector to a character vector in the main data frame -- only in the frequency table data frame that will be inserted into the codebook.

Here is the new code:

dplyr::mutate(
      cat = as.character(cat),
      cat = tidyr::replace_na(cat, "Missing")
    )

We also made a similar change to cb_summary_stats_many_cats.

After making this change and running check() again, there was an error raised by the following code from cb_summary_stats_many_cats:

lowest <- df %>%
    dplyr::group_by({{ .x }}) %>%
    dplyr::summarise(n = n()) %>%

Essentially, the error wanted me to add dplyr:: to n(). Instead, I decided to replace this code with

lowest <- df %>%
    dplyr::count({{ .x }}) %>%

(The same applies to the code that calculates highest)

  • Fix unit test for cb_summary_stats_few_cats
  • Make sure cb_summary_stats_many_cats works with factors
  • Create a unit test for cb_summary_stats_many_cats
  • Fix build checks

Update to rlang 0.4.0

In the process of trying to make this change, it turns out that rlang::sym(.x) is no longer valid code. It returns the following error: Error in ``rlang::sym()`` at codebookr/R/cb_summary_stats_many_cats.R:19:2: Can't convert a function to a symbol.. As of rlang 0.4.0, this line of code is unnecessary. I'm removing it from the code and replacing all of the !!xs with {{ .x }} -- the new preferred tidy evaluation syntax.

  • Update the all of the tidy evaluation code to use rlang 0.4.0 (i.e., curly curly) sytax where appropriate.
  • Fix build checks

Updates to the codebook function

I'm finally to the point where I'm running the codebook() function and I've run into a couple of issues.

  • I got an error that says Error in ``group_by()``: ! Must group by variables found in ``.data``. ✖ Column ``.x`` is not found.
  • I want to change the example code to use the study data

But there are some bigger, more fundamental changes I want to make too.

Remove the path argument

Currently, the codebook() function essentially expects you to pass it the data frame twice in two different ways:

  1. It expects you to pass an in-memory (i.e., in the global environment) version of the data frame to the df argument. This version of the data frame is the one that gets most of the action inside the codebook function.
  2. It expects you to pass a path to an on-disk (i.e., saved as .csv, .rds, etc.) version of the data frame to the path argument. It looks like this is only used to gather the last modified data value for the metadata table in the codebook.

So, the only purpose of the path argument is to gather the last modified data value for the metadata table in the codebook; yet, it comes with several downsides.

  1. Having a df and path argument in the codebook() function is confusing for the user. What if there isn't an on-disk version of the data frame? What if there are multiple on-disk versions of the data frame saved in different formats? Which one do you use?
  2. I had to copy a bunch of code from utils:::format.object_size. There's a note that says I copied the code because CRAN won't allow me to use ::: inside of my function. All of this code can be removed if I get rid of the path argument.

Instead, I can add a Last updated value to the metadata table. This would get around the downsides of using the path argument, and the date the codebook was last updated is probably no less useful than the date the file was last modified in most cases.

The group_by() error

After removing the path argument, I came back to the group_by error. I don't have it completely figured out yet, but it seems like removing all of the equo() syntax and replacing it with the curly-curly syntax is changing the way that .x is being passed down through the cascade of functions that create the summary tables.

For example, id, the first column in study gets passed to the cb_add_summary_stats() function inside of a loop inside of the codebook function via the .x argument (i.e., cb_add_summary_stats(col_nms[[i]])).

It then gets passed to the cb_summary_stats_many_cats() function inside of the cb_add_summary_stats() function via the .x argument (i.e., cb_summary_stats_many_cats(df, .x, n_extreme_cats)).

It appears as though this is where the problem is. It's getting passed to cb_summary_stats_many_cats() as a literal .x instead of a quoture. So, I think I need to add the enquo() syntax back to cb_add_summary_stats().

The fix for the group_by error was to use .data[[.x]] syntax instead of {{ .x }} syntax. I documented the solution here

dplyr Error in stop_vctrs():! x must be a vector, not a <> object when a class is added

The cb_add_summary_stats() function adds a new class to each of the summary stats data frames. The cb_summary_stats_to_ft() function uses that class to determine which method to use to make a flextable from the summary stats data frame.

When the data frame with the added class was passed to the line dplyr::mutate(across(everything(), as.character)) in cb_summary_stats_to_ft.summary_many_cats, it was returning the following error: Error in stop_vctrs(): ! x must be a vector, not a <tbl_df/tbl/data.frame/make_char> object.. After Googling a little bit, I found the following in the breaking changes section of the changelog for dplyr 1.0.0:

Extending data frames requires that the extra class or classes are added first, not last. Having the extra class at the end causes some vctrs operations to fail with a message like: Input must be a vector, not a <data.frame/...> object

Adding the new class to the front of the class list fixes the problem.

Now, I think I need to add a logical vector and a pure time vector for testing.

Checking to make sure the user isn't piping the data frame into codebook

One of the codebook checks is to make sure the user doesn't pipe the data frame into the codebook function. When they do, Dataset name: in the metadata table (below) is ".". I wanted to fix that, but I don't think it's going to be possible (https://stackoverflow.com/questions/42560389/get-name-of-dataframe-passed-through-pipe-in-r). So, I updated the message to be a little more clear instead.

  • Remove the path argument
  • Add last updated date to metadata table
  • Fix group_by error
  • Change the example code to use the study data
  • Add logical vector to study
  • Add a time vector to study
  • Add likert to study
  • Deal with time vector in cb_add_summary_stats
  • Create test for cb_add_summary_stats
  • Improve documentation for the cb_add_summary_stats arguments - use language in test
  • Create test for codebook
  • Figure out what to do with officer::body_remove() on line 57 of codebook
  • Run checks
  • Push commits

Clean up the Word document formatting

  • Remove NULLs from the codebook tables
  • Fix the Frequeny typo
  • Remove or add use defaults for column description, source information, and column type
  • Time doubles all of the attributes in the column attributes table. Concatenate into a single string.
  • Fix the double space between "All" and "20" for the mode value of the time column

Add an example of using data imported from Stata

I think completing this will also help with #9

Haven labeled columns

When you import data from SAS, Stata, or SPSS using Haven, it adds two classes to variables with value labels: haven_labelled and vctrs_vctr. Passing these columns to codebook() results in the following error:

Error in cb_add_summary_stats(., col_nms[[i]]) : 
Column sex is of unknown type. Please set the col_type attribute

One way to get around this is simply to set the col_type attribute like this:

study <- study %>% 
  cb_add_col_attributes(sex, col_type = "Categorical")

However, because Haven labeled data is so common, we decided to specifically look for and remove those classes in cb_summary_stats.R. It should not remove those classes from the column generally -- just for the process of determining the column type and calculating descriptive statistics.

Using Haven labels

When we import data from Stata, SAS, or SPSS with labels, the attributes are called $label for variable labels and $labels for value labels. Currently, codebook() cannot automatically make use of those attributes because it only recognizes the attributes description, source, and col_type. It's relatively easy to manually set the value of the description attribute to the value of the label attribute like this

attr(study$sex, "description") <- attr(study$sex, "label") 

Which can be extended in a for loop. However, because Haven labeled data is so common, we decided to specifically look for $label and $labels in cb_get_col_attibutes.

  • Figure out if I need to do anything special to use the data in inst/extdata
  • Build to make sure there are no issues with the files that are being added (e.g., coderbookr_graphics.pptx)