DS4PS/cpp-525-fall-2020

Lab 6 - Error with Predict

AprilPeck opened this issue · 6 comments

@Dselby86
I am trying to do question 3d using the Predict function:

data2 <- with(data, 
              data.frame(Democrats = mean(data$Democrats),
                         Evangelics = mean(data$Evangelics),
                         Catholics = mean(data$Catholics),
                         Media = quantile(data$Media, .95),
                         Merck = mean(data$Merck)))

data2$Adopt_prob <- predict(log, newdata = data2, type = "response")

But keep getting the following error:

Error in `$<-.data.frame`(`*tmp*`, Adopt_prob, value = c(`1` = 0.411174458260132, : replacement has 49 rows, data has 1

My data2 data frame looks good, but it seems like the predict function is trying to return too many rows.

image

lecy commented

What does your model look like?

I would try to simplify as much as possible to see if you can diagnose the problem yourself.

I'm not sure what purpose with() is serving in your data frame construction since the data.frame() function is used to build a new data frame:

data2 <- 
data.frame(      
  Democrats = mean(data$Democrats),
  Evangelics = mean(data$Evangelics),
  Catholics = mean(data$Catholics),
  Media = quantile(data$Media, .95),
  Merck = mean(data$Merck) 
 )

I suspect it might have impacted the object. With your previous code what did the following return?

library( dplyr )
library( pander )

class( data2 )
dim( data2 )
data2 %>% pander()

Do those change?

@lecy It's giving me the same error, even without the "with" function. (I wasn't sure why it was there either, but it was in the lab sample code and I was trying everything I could think of.)

Class = data.frame
dim = 1 5
data2:
image

I can just use the formula for this question, but run into the same problem with q3f.

lecy commented

What is your model ?

@lecy I feel stupid saying this, but I don't know what you're asking.

I think this is what you're looking for...
image

lecy commented

Got it - the problem is the variable names.

The predict() function will match coefficient names from the model with variable (column) names from the new dataset in order to create the y-hat value.

You should be using the first version of the glm() here where you use the variable names directly and tell it which data frame you are using (data=dat).

Otherwise in the model your variables will be named data$Democrat instead of Democrat, etc.

m <- glm( Adoption ~ Democrats + Evangelics + Catholics + Media + Merck, data=dat, family="binomial" )
m2 <- glm( dat$Adoption ~ dat$Democrats + dat$Evangelics + dat$Catholics + dat$Media + dat$Merck, family="binomial" )

> summary( m )

Call:
glm(formula = Adoption ~ Democrats + Evangelics + Catholics + 
    Media + Merck, family = "binomial", data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8892  -0.5959  -0.2235   0.4907   2.4277  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept)  7.7176493  2.9623236   2.605  0.00918 **
Democrats   -0.7160462  1.9068860  -0.376  0.70728   
Evangelics  -6.0438723  2.6817890  -2.254  0.02422 * 
Catholics    1.5925736  2.6708639   0.596  0.55099   
Media       -0.0151480  0.0047374  -3.198  0.00139 **
Merck       -0.0002314  0.0003379  -0.685  0.49348   
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.906  on 48  degrees of freedom
Residual deviance: 37.604  on 43  degrees of freedom
AIC: 49.604

Number of Fisher Scoring iterations: 6

> summary( m2 )

Call:
glm(formula = dat$Adoption ~ dat$Democrats + dat$Evangelics + 
    dat$Catholics + dat$Media + dat$Merck, family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8892  -0.5959  -0.2235   0.4907   2.4277  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)     7.7176493  2.9623236   2.605  0.00918 **
dat$Democrats  -0.7160462  1.9068860  -0.376  0.70728   
dat$Evangelics -6.0438723  2.6817890  -2.254  0.02422 * 
dat$Catholics   1.5925736  2.6708639   0.596  0.55099   
dat$Media      -0.0151480  0.0047374  -3.198  0.00139 **
dat$Merck      -0.0002314  0.0003379  -0.685  0.49348   
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.906  on 48  degrees of freedom
Residual deviance: 37.604  on 43  degrees of freedom
AIC: 49.604

Number of Fisher Scoring iterations: 6