Two-Way-ANOVA results do not show p-values.
FriedrichLidenbach opened this issue · 3 comments
Sup. I was trying to perform a Two Way Anova on this very simple .CSV file:
block,color,response
a,red,1.9
b,red,2.6
c,red,3.4
(And so on...)
v,red,4.5
w,red,3.4
x,red,1.1
a,green,9.1
b,green,6.4
c,green,1.2
(And so on...)
v,green,17
w,green,9.8
x,green,6.8
a,blue,7.6
b,blue,5.6
c,blue,14
(And so on...)
v,blue,19
w,blue,9
x,blue,6.8
Of course, the (And so on...) entries are mine for brevity and not in the actual file. I can provide the entire file if needed, althought the values follow the pattern shown above. It is saved in this directory: '/home/f/Downloads/color-anova-example.csv . I tried:
import pingouin as pg
import pandas as pd
df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
anova = pg.anova(dv='response', between=['color', 'block'], data=df)
print(anova)
And got:
/home/f/.local/lib/python3.10/site-packages/pingouin/parametric.py:1093: RuntimeWarning: invalid value encountered in scalar divide
ms_resid = ss_resid / df_resid
Source SS DF MS np2
0 color 857.199103 2 428.599551 1.0
1 block 1481.825128 23 64.427179 1.0
2 color * block 514.571297 46 11.186333 1.0
3 Residual 0.000000 0 NaN NaN
So, no p-unc and that weird warning. I thought maybe there was something wrong with the file, but then I tried running it on Rstudio and using statsmodels and it worked just fine:
R-Studio: Code
data = read.csv('/home/f/Downloads/color-anova-example.csv')
modelo = lm(data$response ~ data$block + data$color)
anova = aov(modelo)
summary(anova)
R-Studio result:
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
data$block 23 1481.8 64.4 5.759 2.42e-07 ***
data$color 2 857.2 428.6 38.315 1.61e-10 ***
Residuals 46 514.6 11.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Statsmodels code:
import pingouin as pg
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
model = ols('response ~ C(color) + C(block)', data=df).fit()
anova = sm.stats.anova_lm(model, typ=2)
print(anova)
Statsmodels result:
sum_sq df F PR(>F)
C(color) 857.199103 2.0 38.314573 1.606109e-10
C(block) 1481.825128 23.0 5.759455 2.415486e-07
Residual 514.571297 46.0 NaN NaN
[Process exited 0]
What is perplexing is that the ANOVA result doesn't show any errors, a NAN or a nonsensical value, it just omits the p-unc column all together. Weirder is that this problem only presents itself on the Two Way ANOVA, if I remove either of the factors and do a One Way ANOVA, it works just fine:
Code for Pingouin One Way ANOVA:
import pingouin as pg
import pandas as pd
df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
anova = pg.anova(dv='response', between=['color'], data=df)
print(anova)
Result for Pingouin One Way ANOVA
Source ddof1 ddof2 F p-unc np2
0 color 2 69 14.813375 0.000004 0.300393
[Process exited 0]
I am using Pingouin 0.5.4 installed trough pip on Linux Mint 21.3 and Kernel 6.5.0-27, on a Dell Latitude Laptop using with Python version 3.10.12.
Besides the error I just wanted to thank you guys fro your work on the project. Pingouin is by far my favorite statistics library. Thanks!
Hi,
Besides the error I just wanted to thank you guys fro your work on the project. Pingouin is by far my favorite statistics library.
Thanks for the kind words, appreciate it! I only have time to do basic maintenance and Q&A these days unfortunately, so the library hasn't really evolved or improved in two years.
That is such a weird error — I've never seen it. Could you actually share a more complete dataset here? I'd like to see if I can fully reproduce your error. Thanks
Sure, it is actually a public dataset available in this link:
https://researchguides.library.vanderbilt.edu/ld.php?content_id=19195103
From this site:
https://researchguides.library.vanderbilt.edu/c.php?g=156859&p=1171834
And just in case, this are the complete contents of the file:
block,color,response
a,red,1.9
b,red,2.6
c,red,3.4
d,red,0.8
e,red,5.3
f,red,1.5
g,red,4.5
h,red,2.6
i,red,1.16
j,red,1.3
k,red,2
l,red,2
m,red,6.7
n,red,1.44
o,red,2
p,red,2
q,red,1
r,red,1
s,red,1.5
t,red,2
u,red,4.1
v,red,4.5
w,red,3.4
x,red,1.1
a,green,9.1
b,green,6.4
c,green,1.2
d,green,5.7
e,green,17.7
f,green,6.4
g,green,16.6
h,green,8.3
i,green,4.9
j,green,1
k,green,1
l,green,3
m,green,23
n,green,6.13
o,green,9
p,green,2
q,green,4
r,green,4.5
s,green,10
t,green,4
u,green,27.2
v,green,17
w,green,9.8
x,green,6.8
a,blue,7.6
b,blue,5.6
c,blue,14
d,blue,6.8
e,blue,18.5
f,blue,7.2
g,blue,19.5
h,blue,10.5
i,blue,5.27
j,blue,6
k,blue,8
l,blue,7.5
m,blue,23
n,blue,5.8
o,blue,11
p,blue,9
q,blue,6
r,blue,6
s,blue,9.5
t,blue,8
u,blue,25.6
v,blue,19
w,blue,9
x,blue,6.8
Apologies for the slow reply. The issue comes from the fact that the two-way ANOVA includes an interaction term (color * block
) which cannot be calculated in this dataset since there is only one observation for each level of the interaction term (e.g. the combination (a, blue
) only has one unique value).
The R and statsmodels code that you provided are additive linear regression models that do not match the standard definition of the two-way factorial anova (see here and here)
Changing the statsmodels code to a factorial model (*
instead of +
) leads to an error:
Thanks,
Raphael