sfirke/janitor

Feature Request: A function for quick basic standardization of an otherwise tidy (almost) df

AarshBatra opened this issue · 2 comments

Hello,

I would like to propose the addition of a new function called standardize_tidy_df_cols(df, ...). This function is designed to standardize the columns of a tidy dataset. As we know, a tidy dataset is one that:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value

But, even within a tidy data frame there may be a lot of frustrating issues and similar to how clean_names function standardizes column names, this new proposed function (standardize_tidy_df_cols(df, ...)) that I am working on will standardize columns of a df, which although messy, follows the 3 principles of a tidy df as mentioned above.

For example, here is an otherwise tidy data, but with a lot of potential issues:

df <- tibble(
  sno = 1:5,
  uid = c("00001", "00002", "001", "01234", "12333"),
  col1 = as.character(runif(5)),
  col2 = c("A", "0.5", "B", "1.2", "NA"),
  col3 = c("1", "0.5", "1", "1.2", NA),
  col4 = c("-2.5", "3.0", "4.2", "5.7", "1"), 
  col5 = c("-2", "-2", "4.2", "5.7", "NA"), 
  col6 = c("A", "0.5", "B", "1.2", NA),
  col7 = c("11A", "0.5, a, b", "B", "1.2", NA), 
  col8 = c("11", "6E+09", "6e-04", "NA", NA), 
  col9 = c(NA, NA, NA, NA, NA), 
  col10 =  c("NA", "NA", "NA", "NA", "NA"),  
  log_col = c(TRUE, FALSE, TRUE, FALSE, TRUE), 
  log_col2 =  c("TRUE", "FALSE", "TRUE", "FALSE", "TRUE")
)

Here, are some of the issues:

  • uid col has leading zeros, so should not be coerced to numeric, as otherwise it will alter the numbers.
  • There are two types of NAs: NA, "NA".
  • col3, col4, col5 have numeric values stored in strings and are of class character, but should be of class numeric
  • col8 has numeric values stored in strings, with some scientific notation also stored in strings and as a result is of class character, but should be of class numeric.
  • Two types of logical columns, one that has logical values stored as strings and the other one as actual logical values.

The function that I am writing and improving on currently, fixes all of the above issues and has a parameter to preserve those columns on which the transformations should not be applied, so in the above example, the uid's can be kept as class character, for example, by specifying in the function call. It also takes care of things like: replacing "NA" with to NA in a numeric column, and replacing "NA" with "" in a character column.

I can imaging people getting stuck in similar problems, which are not specific to my dataset, but largely applies to a lot of datasets, and instead of coercing to right types, and fixing these one at a time, especially when there are 100s of columns, they can possibly use this function first and then see where they are at and hopefully it will take care of a lot of their issues.

This function of course doesn't fix everything and also requires that a data frame supplied to it should follow the tidy data principles, but in my experience, I have come across a lot of datasets that follow these principles, but then need much more tidying.

Do you think there is value in adding such a function to janitor?

Please let me know and we can chat, thanks!

Best,
Aarsh

Hi and thanks for the suggestion! Here is my take on designing function(s) that tackle this problem.

  1. I would make this many functions that are as discrete / separate as possible.
  2. I would separate the examination portion from the fixing portion.

As a user my desired workflow would be to run a suite of examination checks, get a list of issues with suggested fixes, and then be able to specify a set of those fixes to be applied. I would not be comfortable applying a set of changes without review, and I expect in many cases the user will need to intervene manually after the examination function finds a possible issue.

You might leverage an R package that does assertive checks for the inspection part. I remember assertive and assertr from years ago but I'm not currently on what's leading these days.

I think there's value in this and in theory it's in-scope for the janitor package but:

  • It will be a lot of work and I can't commit to reviewing and supporting it. I expect it will take many cycles to get it from good to great (as say clean_names changed many times over the years) and the number of possible data quality checks is vast.
  • A package that does these things might already exist and you could improve it instead of starting from scratch. I would start by doing exhaustive research
  • No harm to developing it as a standalone package. If it got to a great place and you still wanted to merge it into janitor, it could happen then

In summary, I think this could be a valuable suite of tools, but a huge undertaking to do a rigorous and complete job. I'd suggest first making sure it doesn't already exist, then if it doesn't, iterating on this for a while. If you end up developing this, I'd be interested in taking a look - feel free to post back here and I'll see it.

I'm going to close this issue as unplanned, but feel free to reply if you have a question or want to discuss further. Good luck with this effort!

Thanks for your reply, @sfirke, this is helpful. I'll keep that in mind for future, while I improve on this.