Data Cleaning vs Exploratory

Question

Data Cleaning vs Exploratory

Opened this issue 2 years ago · 1 comments

Hi Dr.McGowan,

I'm using your tidycode pkg for my independent study. I used it on one of the R scripts I have written in tidyverse syntax and compare the result to my (eye-balled) classification. There is one discrepancy where I would classify the functions as "Exploratory" rather than "Data Cleaning," which is what the tidycode package gave. I recreated those lines and replaced the dataset with the built-in dataset mtcars and obtained the same results (that the used functions such as summarize() and mean() are classified as Data Cleaning rather than exploratory):

library(tidyverse)
data(mtcars)

mtcars %>% summarize(mean(hp, na.rm = TRUE))
mtcars %>% group_by(cyl) %>% summarize(mean(wt, na.rm = TRUE))

Does the package classify all dplyr functions to be Data Cleaning? Is there any way we can remedy this? Thank you.

Answer 1 · 2022-03-23T22:53:22.000Z

The classifications are based on crowd sourced classification (the "score" is the proportion of classifications that gave a specific function that class) -- you can create your own classification lexicon and apply that if you would like for a specific purpose. It doesn't classify all dplyr functions as "Data Cleaning", but the method for classification could definitely be improved! One idea would be to have various "lexicons" based on the context. Currently we only have two (one that was crowd sourced from the "general public" (people who participated mostly recruited via twitter) and the other was by members of Jeff Leek's lab at the time.