topepo/DC-ML-2018

co-branded cars mean some duplicates within the cars_data

Opened this issue · 1 comments

Hi,

This code snippet shows that there are several co-branded cars (GMC/Chevy)/(Ford/Lincoln)/(GMC/Cadillac) exist within the current car_data.

Not sure if you're planning on using this data for the book, but thought I would point it out. This code snippet finds examples of it. Caveat!! We simply can't use this code to exclude observations, but it at least gets us a list to review:

car_train %>% group_by(mpg, model_year) %>% filter(n()>1) %>% arrange(mpg) %>% View

Thanks a bunch for the course and really interesting information you presented!

Tony

Some of these got by since I looked for unique combinations of these four variables:

    mpg = comb_unadj_fe___conventional_fuel, 
    mpg_city_un = city_unrd_adj_fe___conventional_fuel, 
    mpg_hwy_un = hwy_unrd_adj_fe___conventional_fuel,
    mpg_comb = comb_unrd_adj_fe___conventional_fuel,

Your flag catches 465 cars whereas the same group_by using all four catches 321 cars (which is still bad).

I guess my inclination is the use your filter since they are effectively the same. This generates:

> filtered <- 
+     car_data %>% 
+     group_by(mpg, model_year, cylinders, gears, aspiration) %>% 
+     slice(1) %>%
+     arrange(model_year, division, carline)
> 
> table(car_data$model_year)

2015 2016 2017 2018 
1024  646 1015  609 
> 
> table(filtered$model_year)

2015 2016 2017 2018 
 955  606  949  558