pairwise_cor() and NA values
dmklotz opened this issue · 5 comments
I've been experimenting with widyr, and it appears to work perfectly for the standard datasets (e.g. iris, gapminder). However, I'm getting unexpected results when my real data contains any missing values. As an example:
library(dplyr)
library(tidyr)
library(widyr)
set.seed(999)
test <- data.frame(name = c(rep("a", 4), rep("b", 3), rep("c", 5)),
value = runif(12),
index = c(1:4, 1:3, 1:5), stringsAsFactors = FALSE)
name value index
1 a 0.38907138 1
2 a 0.58306072 2
3 a 0.09466569 3
4 a 0.85263123 4
5 b 0.78674676 1
6 b 0.11934226 2
7 b 0.60644699 3
8 c 0.08095691 1
9 c 0.39077242 2
10 c 0.61947239 3
11 c 0.28568882 4
12 c 0.55112274 5
If I pivot this to wide format, there will of course be some NA values:
test_wide <- test %>% spread(key = name, value = value)
index a b c
1 0.38907138 0.7867468 0.08095691
2 0.58306072 0.1193423 0.39077242
3 0.09466569 0.6064470 0.61947239
4 0.85263123 NA 0.28568882
5 NA NA 0.55112274
If I try to calculate the correlations using widyr, I obtain the following:
test %>% pairwise_cor(name, index, value, use = "pairwise")
# A tibble: 6 x 3
item1 item2 correlation
<chr> <chr> <dbl>
1 b a -0.2858011
2 c a -0.5929580
3 a b -0.2858011
4 c b -0.3067296
5 a c -0.5929580
6 b c -0.3067296
However, if I calculate the correlations on the wide format, the results are very different:
test_wide %>%
select(-index) %>% cor(., use = "pairwise") %>%
as.data.frame %>%
mutate(item1 = row.names(.)) %>%
gather(key = item2, value = correlation, -item1) %>%
filter(item1 != item2)
item1 item2 correlation
1 b a -0.6169583
2 c a -0.4615488
3 a b -0.6169583
4 c b -0.3437614
5 a c -0.4615488
6 b c -0.3437614
And to confirm that this latter piping method is correct:
cor(test_wide$a, test_wide$b, use = "p")
[1] -0.6169583
Am I missing something really basic here?
I think this is because widyr's pairwise_cor fills NAs with 0s while converting it to a wide format.
If you do the same with your piping method, you get this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(widyr)
set.seed(999)
test <- data.frame(name = c(rep("a", 4), rep("b", 3), rep("c", 5)),
value = runif(12),
index = c(1:4, 1:3, 1:5), stringsAsFactors = FALSE)
test
#> name value index
#> 1 a 0.38907138 1
#> 2 a 0.58306072 2
#> 3 a 0.09466569 3
#> 4 a 0.85263123 4
#> 5 b 0.78674676 1
#> 6 b 0.11934226 2
#> 7 b 0.60644699 3
#> 8 c 0.08095691 1
#> 9 c 0.39077242 2
#> 10 c 0.61947239 3
#> 11 c 0.28568882 4
#> 12 c 0.55112274 5
# Use fill = 0 while spreading into wide format.
test_wide <- test %>% spread(key = name, value = value, fill = 0)
test_wide
#> index a b c
#> 1 1 0.38907138 0.7867468 0.08095691
#> 2 2 0.58306072 0.1193423 0.39077242
#> 3 3 0.09466569 0.6064470 0.61947239
#> 4 4 0.85263123 0.0000000 0.28568882
#> 5 5 0.00000000 0.0000000 0.55112274
test %>% pairwise_cor(name, index, value)
#> # A tibble: 6 x 3
#> item1 item2 correlation
#> <chr> <chr> <dbl>
#> 1 b a -0.286
#> 2 c a -0.593
#> 3 a b -0.286
#> 4 c b -0.307
#> 5 a c -0.593
#> 6 b c -0.307
test_wide %>%
select(-index) %>%
cor(., use = "pairwise") %>%
as.data.frame %>%
mutate(item1 = row.names(.)) %>%
gather(key = item2, value = correlation, -item1) %>%
filter(item1 != item2)
#> item1 item2 correlation
#> 1 b a -0.2858011
#> 2 c a -0.5929580
#> 3 a b -0.2858011
#> 4 c b -0.3067296
#> 5 a c -0.5929580
#> 6 b c -0.3067296
Thanks for figuring that out! I see where that happens now in the code for widely_
.
Maybe that works for other widyr functions, but I don't think it's mathematically correct to impute zero for all missing values when calculating correlations.
Similar issue as above. I'm writing with regards to your un_votes tutorial.
Following your tutorial, we get
cors <- un_votes %>%
mutate(vote = as.numeric(vote)) %>%
pairwise_cor(country, rcid, vote, use = "pairwise.complete.obs", sort = TRUE)
## # A tibble: 39,800 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 Slovakia Czech Republic 0.989
## 2 Czech Republic Slovakia 0.989
## 3 Lithuania Estonia 0.971
## 4 Estonia Lithuania 0.971
## 5 Lithuania Latvia 0.970
## 6 Latvia Lithuania 0.970
## 7 Germany Liechtenstein 0.968
## 8 Liechtenstein Germany 0.968
## 9 Slovakia Slovenia 0.966
## 10 Slovenia Slovakia 0.966
## # ... with 39,790 more rows
If we check one pair (Lithuania-Estonia) using the cor
function in base R, we get
un_wide <- un_votes %>%
select(rcid, country, vote) %>%
mutate(vote=as.numeric(vote)) %>%
spread(key=country, value=vote)
cor(un_wide$Lithuania, un_wide$Estonia,
use = "pairwise.complete.obs")
[1] 0.9629791
The correlation stat (0.9629791) is not equal to what widyr::pairwise_cor
gives us (0.971).
And the problem seems to be that widyr::pairwise_cor
is treating missing values as 0s. If we change all NAs to 0s, we get:
un_wide <- un_votes %>%
select(rcid, country, vote) %>%
mutate(vote=as.numeric(vote)) %>%
spread(key=country, value=vote) %>%
mutate_all(funs(replace(., is.na(.), 0)))
cor(un_wide$Lithuania, un_wide$Estonia,
use = "pairwise.complete.obs")
[1] 0.9714049
This is what widyr::pairwise_cor
has for Estonia-Lithuania too.
It would be great if you could add a parameter to widyr::pairwise_cor
so that we can control how R treats NAs. Right now the use
parameter seems overridden. Thanks!
I can try and work on this soon! Thanks for pointing this out :)
This should be solved if my current PR is approved! Sorry for the inconvenience and thank you for pointing it out!