Feature Request: A function for quick basic standardization of an otherwise tidy (almost) df
AarshBatra opened this issue · 2 comments
Hello,
I would like to propose the addition of a new function called standardize_tidy_df_cols(df, ...)
. This function is designed to standardize the columns of a tidy dataset. As we know, a tidy dataset is one that:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value
But, even within a tidy data frame there may be a lot of frustrating issues and similar to how clean_names
function standardizes column names, this new proposed function (standardize_tidy_df_cols(df, ...)
) that I am working on will standardize columns of a df, which although messy, follows the 3 principles of a tidy df as mentioned above.
For example, here is an otherwise tidy data, but with a lot of potential issues:
df <- tibble(
sno = 1:5,
uid = c("00001", "00002", "001", "01234", "12333"),
col1 = as.character(runif(5)),
col2 = c("A", "0.5", "B", "1.2", "NA"),
col3 = c("1", "0.5", "1", "1.2", NA),
col4 = c("-2.5", "3.0", "4.2", "5.7", "1"),
col5 = c("-2", "-2", "4.2", "5.7", "NA"),
col6 = c("A", "0.5", "B", "1.2", NA),
col7 = c("11A", "0.5, a, b", "B", "1.2", NA),
col8 = c("11", "6E+09", "6e-04", "NA", NA),
col9 = c(NA, NA, NA, NA, NA),
col10 = c("NA", "NA", "NA", "NA", "NA"),
log_col = c(TRUE, FALSE, TRUE, FALSE, TRUE),
log_col2 = c("TRUE", "FALSE", "TRUE", "FALSE", "TRUE")
)
Here, are some of the issues:
uid
col has leading zeros, so should not be coerced tonumeric
, as otherwise it will alter the numbers.- There are two types of NAs:
NA
,"NA"
. col3
,col4
,col5
have numeric values stored in strings and are of classcharacter
, but should be of classnumeric
col8
has numeric values stored in strings, with some scientific notation also stored in strings and as a result is of classcharacter
, but should be of classnumeric
.- Two types of
logical
columns, one that has logical values stored as strings and the other one as actual logical values.
The function that I am writing and improving on currently, fixes all of the above issues and has a parameter to preserve those columns on which the transformations should not be applied, so in the above example, the uid's can be kept as class character, for example, by specifying in the function call. It also takes care of things like: replacing "NA"
with to NA
in a numeric
column, and replacing "NA"
with ""
in a character
column.
I can imaging people getting stuck in similar problems, which are not specific to my dataset, but largely applies to a lot of datasets, and instead of coercing to right types, and fixing these one at a time, especially when there are 100s of columns, they can possibly use this function first and then see where they are at and hopefully it will take care of a lot of their issues.
This function of course doesn't fix everything and also requires that a data frame supplied to it should follow the tidy data principles, but in my experience, I have come across a lot of datasets that follow these principles, but then need much more tidying.
Do you think there is value in adding such a function to janitor?
Please let me know and we can chat, thanks!
Best,
Aarsh
Hi and thanks for the suggestion! Here is my take on designing function(s) that tackle this problem.
- I would make this many functions that are as discrete / separate as possible.
- I would separate the examination portion from the fixing portion.
As a user my desired workflow would be to run a suite of examination checks, get a list of issues with suggested fixes, and then be able to specify a set of those fixes to be applied. I would not be comfortable applying a set of changes without review, and I expect in many cases the user will need to intervene manually after the examination function finds a possible issue.
You might leverage an R package that does assertive checks for the inspection part. I remember assertive
and assertr
from years ago but I'm not currently on what's leading these days.
I think there's value in this and in theory it's in-scope for the janitor package but:
- It will be a lot of work and I can't commit to reviewing and supporting it. I expect it will take many cycles to get it from good to great (as say
clean_names
changed many times over the years) and the number of possible data quality checks is vast. - A package that does these things might already exist and you could improve it instead of starting from scratch. I would start by doing exhaustive research
- No harm to developing it as a standalone package. If it got to a great place and you still wanted to merge it into janitor, it could happen then
In summary, I think this could be a valuable suite of tools, but a huge undertaking to do a rigorous and complete job. I'd suggest first making sure it doesn't already exist, then if it doesn't, iterating on this for a while. If you end up developing this, I'd be interested in taking a look - feel free to post back here and I'll see it.
I'm going to close this issue as unplanned, but feel free to reply if you have a question or want to discuss further. Good luck with this effort!
Thanks for your reply, @sfirke, this is helpful. I'll keep that in mind for future, while I improve on this.