not all patterns with NA counted?
timbp opened this issue · 3 comments
It seems that if a variable has missing values, not all patterns are counted. Is this intended?
`
g1 = gammaCKpar(dfA$firstname, dfB$firstname)
g2 = gammaCKpar(dfA$lastname, dfB$lastname)
tc = tableCounts(list(g1, g2), nrow(dfA), nrow(dfB))
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
tc
gamma.1 gamma.2 counts
[1,] 0 0 172338
[2,] 1 0 271
[3,] 2 0 2170
[4,] 0 1 50
[5,] 0 2 120
[6,] 1 2 1
[7,] 2 2 50
attr(,"class")
[1] "fastLink" "tableCounts"`
No missing values in these two variables. Counts sum to 175000 (== 500 * 350), and pattern (2, 2) has count of 50.
Add middlename, which has missing values:
`> g3 = gammaCKpar(dfA$middlename, dfB$middlename)
t = tableCounts(list(g1, g2, g3), nrow(dfA), nrow(dfB))
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
t
gamma.1 gamma.2 gamma.3 counts
[1,] 0 0 0 115305
[2,] 1 0 0 193
[3,] 2 0 0 1477
[4,] 0 1 0 39
[5,] 0 2 0 79
[6,] 1 2 0 1
[7,] 0 0 1 24
[8,] 0 0 2 816
[9,] 1 0 2 2
[10,] 2 0 2 10
[11,] 0 2 2 1
[12,] 2 2 2 43
[13,] 0 0 NA 50690
[14,] 1 0 NA 68
[15,] 2 0 NA 615
[16,] 0 1 NA 10
[17,] 0 2 NA 37
attr(,"class")
[1] "fastLink" "tableCounts"`
Counts now sum to 169410 so it appears 5590 pairs have not been counted. Pattern (2, 2, 2) has count of 43, but there are no other patterns starting (2, 2, ...) so 7 pairs that match on both firstname and lastname do not seem to appear in this table.
When I made my own code (in Julia) to count patterns, I got the following result:
`
0 0 0 115305
1 0 0 193
2 0 0 1477
0 1 0 39
0 2 0 79
1 2 0 1
0 0 1 24
0 0 2 816
1 0 2 2
2 0 2 10
0 2 2 1
2 2 2 43
0 0 missing 56193
1 0 missing 76
2 0 missing 683
0 1 missing 11
0 2 missing 40
2 2 missing 7`
Differences from the fastLink results are all in the patterns containing missing values.
Hi,
Thanks a lot for raising this issue. Rest assured that we will take a close look. The counts for patterns that include a missing value should not miss pairs.
Ted
Hi,
Thanks again for raising this issue!
There was a problem on how missing values were handled gammaCKpar()
. The issue has been resolved and if you install using devtools
your R code should produce the desired output.
If anything, please do not hesitate to reach out.
Ted
all looks good now