juliasilge/widyr

pairwise_cor() and NA values

dmklotz opened this issue · 5 comments

I've been experimenting with widyr, and it appears to work perfectly for the standard datasets (e.g. iris, gapminder). However, I'm getting unexpected results when my real data contains any missing values. As an example:

library(dplyr)
library(tidyr)
library(widyr)

set.seed(999)
test <- data.frame(name = c(rep("a", 4), rep("b", 3), rep("c", 5)),
                                   value = runif(12),
               index = c(1:4, 1:3, 1:5), stringsAsFactors = FALSE)

   name      value index
1     a 0.38907138     1
2     a 0.58306072     2
3     a 0.09466569     3
4     a 0.85263123     4
5     b 0.78674676     1
6     b 0.11934226     2
7     b 0.60644699     3
8     c 0.08095691     1
9     c 0.39077242     2
10    c 0.61947239     3
11    c 0.28568882     4
12    c 0.55112274     5

If I pivot this to wide format, there will of course be some NA values:

  test_wide <- test %>% spread(key = name, value = value)

  index          a         b          c
      1 0.38907138 0.7867468 0.08095691
      2 0.58306072 0.1193423 0.39077242
      3 0.09466569 0.6064470 0.61947239
      4 0.85263123        NA 0.28568882
      5         NA        NA 0.55112274

If I try to calculate the correlations using widyr, I obtain the following:

test %>% pairwise_cor(name, index, value, use = "pairwise")

# A tibble: 6 x 3
  item1 item2 correlation
  <chr> <chr>       <dbl>
1     b     a  -0.2858011
2     c     a  -0.5929580
3     a     b  -0.2858011
4     c     b  -0.3067296
5     a     c  -0.5929580
6     b     c  -0.3067296    

However, if I calculate the correlations on the wide format, the results are very different:

test_wide %>%
  select(-index) %>% cor(., use = "pairwise") %>% 
  as.data.frame %>%
  mutate(item1 = row.names(.)) %>% 
  gather(key = item2, value = correlation, -item1) %>% 
  filter(item1 != item2)

  item1 item2 correlation
1     b     a  -0.6169583
2     c     a  -0.4615488
3     a     b  -0.6169583
4     c     b  -0.3437614
5     a     c  -0.4615488
6     b     c  -0.3437614

And to confirm that this latter piping method is correct:

cor(test_wide$a, test_wide$b, use = "p")
[1] -0.6169583

Am I missing something really basic here?

I think this is because widyr's pairwise_cor fills NAs with 0s while converting it to a wide format.

If you do the same with your piping method, you get this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(widyr)

set.seed(999)
test <- data.frame(name = c(rep("a", 4), rep("b", 3), rep("c", 5)),
                   value = runif(12),
                   index = c(1:4, 1:3, 1:5), stringsAsFactors = FALSE)
test
#>    name      value index
#> 1     a 0.38907138     1
#> 2     a 0.58306072     2
#> 3     a 0.09466569     3
#> 4     a 0.85263123     4
#> 5     b 0.78674676     1
#> 6     b 0.11934226     2
#> 7     b 0.60644699     3
#> 8     c 0.08095691     1
#> 9     c 0.39077242     2
#> 10    c 0.61947239     3
#> 11    c 0.28568882     4
#> 12    c 0.55112274     5

# Use fill = 0 while spreading into wide format.
test_wide <- test %>% spread(key = name, value = value, fill = 0)

test_wide
#>   index          a         b          c
#> 1     1 0.38907138 0.7867468 0.08095691
#> 2     2 0.58306072 0.1193423 0.39077242
#> 3     3 0.09466569 0.6064470 0.61947239
#> 4     4 0.85263123 0.0000000 0.28568882
#> 5     5 0.00000000 0.0000000 0.55112274

test %>% pairwise_cor(name, index, value)
#> # A tibble: 6 x 3
#>   item1 item2 correlation
#>   <chr> <chr>       <dbl>
#> 1 b     a          -0.286
#> 2 c     a          -0.593
#> 3 a     b          -0.286
#> 4 c     b          -0.307
#> 5 a     c          -0.593
#> 6 b     c          -0.307

test_wide %>%
  select(-index) %>%
  cor(., use = "pairwise") %>%
  as.data.frame %>%
  mutate(item1 = row.names(.)) %>% 
  gather(key = item2, value = correlation, -item1) %>% 
  filter(item1 != item2)
#>   item1 item2 correlation
#> 1     b     a  -0.2858011
#> 2     c     a  -0.5929580
#> 3     a     b  -0.2858011
#> 4     c     b  -0.3067296
#> 5     a     c  -0.5929580
#> 6     b     c  -0.3067296

Thanks for figuring that out! I see where that happens now in the code for widely_.

Maybe that works for other widyr functions, but I don't think it's mathematically correct to impute zero for all missing values when calculating correlations.

Similar issue as above. I'm writing with regards to your un_votes tutorial.

Following your tutorial, we get

cors <- un_votes %>%
  mutate(vote = as.numeric(vote)) %>%
  pairwise_cor(country, rcid, vote, use = "pairwise.complete.obs", sort = TRUE)


## # A tibble: 39,800 x 3
##    item1          item2          correlation
##    <chr>          <chr>                <dbl>
##  1 Slovakia       Czech Republic       0.989
##  2 Czech Republic Slovakia             0.989
##  3 Lithuania      Estonia              0.971
##  4 Estonia        Lithuania            0.971
##  5 Lithuania      Latvia               0.970
##  6 Latvia         Lithuania            0.970
##  7 Germany        Liechtenstein        0.968
##  8 Liechtenstein  Germany              0.968
##  9 Slovakia       Slovenia             0.966
## 10 Slovenia       Slovakia             0.966
## # ... with 39,790 more rows

If we check one pair (Lithuania-Estonia) using the cor function in base R, we get

un_wide <- un_votes %>%
  select(rcid, country, vote) %>%
  mutate(vote=as.numeric(vote)) %>%
  spread(key=country, value=vote)

cor(un_wide$Lithuania, un_wide$Estonia,
    use = "pairwise.complete.obs")

[1] 0.9629791

The correlation stat (0.9629791) is not equal to what widyr::pairwise_cor gives us (0.971).

And the problem seems to be that widyr::pairwise_cor is treating missing values as 0s. If we change all NAs to 0s, we get:

un_wide <- un_votes %>%
  select(rcid, country, vote) %>%
  mutate(vote=as.numeric(vote)) %>%
  spread(key=country, value=vote) %>%
  mutate_all(funs(replace(., is.na(.), 0)))

cor(un_wide$Lithuania, un_wide$Estonia,
    use = "pairwise.complete.obs")

[1] 0.9714049

This is what widyr::pairwise_cor has for Estonia-Lithuania too.

It would be great if you could add a parameter to widyr::pairwise_cor so that we can control how R treats NAs. Right now the use parameter seems overridden. Thanks!

I can try and work on this soon! Thanks for pointing this out :)

This should be solved if my current PR is approved! Sorry for the inconvenience and thank you for pointing it out!