Possible Incorrect Accuracy Estimates from HIBAG - Please help !
Closed this issue · 2 comments
It seems that HiBAG does NOT include those samples for accuracy calculation that have alleles that are NOT found in the training model. As an example, I have a training dataset where there are NO copies of the A23:17 allele. However, in my test dataset there are many copies of that allele. I see that any samples that had at least one copy of A23:17 has been removed from the accuracy calculation. I am not sure if this is intended or if I am missing something ?
In your example, I don't see any problem:
pred_val <- list(locus="A", value=data.frame(
sample.id=c("HG01890", "HG01894", "HG01896"),
allele1=c("01:01", "23:01", "23:01"),
allele2=c("30:01", "23:01", "23:01"),
stringsAsFactors=FALSE)
)
class(pred_val) <- "hlaAlleleClass"
true_val <- pred_val
true_val$value[3, 3] <- "23:17"
hlaCompareAllele(true_val, pred_val)
hlaCompareAllele()
outputs
$overall
total.num.ind crt.num.ind crt.num.haplo acc.ind acc.haplo call.threshold
1 3 2 5 0.6666667 0.8333333 0
n.call call.rate
1 3 1
$confusion
True
Predict 01:01 23:01 23:17 30:01
01:01 1 0 0 0
23:01 0 3 1 0
23:17 0 0 0 0
30:01 0 0 0 1
... 0 0 0 0
See that acc.haplo=0.8333333
correctly.
By the way, according to HLA p-code, A23:17 and A23:01 are in the same p-coded group.
https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/wmda/hla_nom_p.txt
Old HLA typing techniques might not be able to differentiate A23:17 from A23:01.
Sorry I haven't had time to get back to this. I will look at it soon and get back to you or close the issue. Apologize for the delay.