luisDVA/unheadr

Make many header rows into column names

Closed this issue · 3 comments

Really like this package, but, I ran into a case I could not handle. Sometimes, there are many rows at the head of a data frame that contain information on column names. They can be ragged (e.g., some cells have entries, and some do not). I used this package as inspiration to write a function to deal with this in some data processing I'm doing, but realized, you might want it to be in the package! Rather than fork and pull, here's what I wrote - modify as you like, and I think it could find a happy home here. Or maybe I missed work you already did (hope not!)

#' Make many initial row headings into names
#'
#' @param adf A `data.frame` or `tibble` object whose top few rows are the 
#' new column names.
#' @param nrows How many rows at the top are to be used to create new column 
#' names.
#' @param sliding_headers Defaults to TRUE. Many data sets only enter 
#' part of a column name once in a single row, and then have several subrows 
#' beneath them. E.g., Row 1 has "Fertilized" in column 1 and "Unfertilized" 
#' in Row 1 Column 4. Row 2 then has the same two entries repeated twice, 
#' such as "Caged" and "Uncaged". We want "Fertilized_Caged", 
#' "Fertilized_Uncaged", "Unfertilized_Caged", "Unfertilized_Uncaged". 
#' This is not always the case, however, so, a TRUE/FALSE switch.
#' @param sep How should different cells be combined? Defaults to "_".
#'
#' @return The original data frame, but with new names and the 
#' top `nrows` rows removed
#' @export
#'
#' @examples
#' 
#' test_df <- tibble::tribble(
#' ~x1,    ~x2,    ~x3,    ~x4,       ~x5,
#' "",    "Yes",  "",     "No",       "",
#' "",    "We",   "Have", "Bananas",  "",
#' "Why",  "",    "",     "",         "Today",
#' "1",    "2",   "3",    "4",        "5"
#' )
#' 
#' print(test_df)
#' print(many_name_rows(test_df, 3))

many_name_rows <- function(adf, 
                           nrows, 
                           sliding_headers=TRUE,
                           sep = "_"){
  
  #get the top rows of the data frame that contain the column names
  new_names <- adf[seq_len(nrows),] %>%
    dplyr::mutate(ROW_ID = 1:nrows)
  
  #pivot the data longer for easy manipulation
  #and give a column number, as column names are often
  #not properly sortable later. Also, make "" into NA
  #for easier string substitution later
  new_names <- new_names %>%
    tidyr::pivot_longer(cols = -ROW_ID,
                        names_to = "cols",
                        values_to = "vals") %>%
    dplyr::mutate(vals = gsub("^$", NA, vals),
                  col_num = rep(seq_len(dplyr::n_distinct(cols)),
                                dplyr::n()/dplyr::n_distinct(cols)))
  
  #if some header entries are only put in once, but should be copied
  #across many columns, slide them down using tidyr::fill
  if(sliding_headers){
    new_names <- new_names %>%
      dplyr::group_by(cols) %>%
      tidyr::fill(vals, .direction = "down") %>%
      dplyr::ungroup()
  }
  
  #collapse the column names together into single elements
  new_names <- new_names %>%
    dplyr::group_by(col_num) %>%
    dplyr::summarize(vals = paste0(vals, collapse = sep),
                     .groups = "keep") %>%
    dplyr::mutate(vals = gsub("NA_", "", vals),
                  vals = gsub("_NA", "", vals) #end of row
    )
  
  #pivot wide using the correct column order
  new_names_wide <- new_names %>%
    tidyr::pivot_wider(names_from = col_num,
                       values_from = vals)
  
  #fix the names of the old data frame
  names(adf) <- new_names_wide[1,]
  
  #strip the first few rows of the data frame
  #and return
  adf[-seq_len(nrows),]
  
}

I realize I might have re-invented unbreak_rows a wee bit... but not entirely, as particularly the fill issue is different, and this method doesn't rely on character matching. Hope it helps! I couldn't get unbreak_rows to do what I needed, and this seemed similar.

Thanks for the input and for letting me know that you use the package. I do realize that I tend to over-rely on regex matching so this looks very promising. Let me play around with your function and hopefully once I understand the workflow I can add it (with proper attribution). I've actually had to deal with these ragged headers as well.
cheers

I see it's in now! Fantastic!