zhengxwen/HIBAG

Possible Incorrect Accuracy Estimates from HIBAG - Please help !

Closed this issue · 2 comments

It seems that HiBAG does NOT include those samples for accuracy calculation that have alleles that are NOT found in the training model. As an example, I have a training dataset where there are NO copies of the A23:17 allele. However, in my test dataset there are many copies of that allele. I see that any samples that had at least one copy of A23:17 has been removed from the accuracy calculation. I am not sure if this is intended or if I am missing something ?

In your example, I don't see any problem:

pred_val <- list(locus="A", value=data.frame(
	sample.id=c("HG01890", "HG01894", "HG01896"),
	allele1=c("01:01", "23:01", "23:01"),
	allele2=c("30:01", "23:01", "23:01"),
	stringsAsFactors=FALSE)
)
class(pred_val) <- "hlaAlleleClass"

true_val <- pred_val
true_val$value[3, 3] <- "23:17"

hlaCompareAllele(true_val, pred_val)

hlaCompareAllele() outputs

$overall
  total.num.ind crt.num.ind crt.num.haplo   acc.ind acc.haplo call.threshold
1             3           2             5 0.6666667 0.8333333              0
  n.call call.rate
1      3         1

$confusion
       True
Predict 01:01 23:01 23:17 30:01
  01:01     1     0     0     0
  23:01     0     3     1     0
  23:17     0     0     0     0
  30:01     0     0     0     1
  ...       0     0     0     0

See that acc.haplo=0.8333333 correctly.

By the way, according to HLA p-code, A23:17 and A23:01 are in the same p-coded group.
https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/wmda/hla_nom_p.txt
Old HLA typing techniques might not be able to differentiate A23:17 from A23:01.

Sorry I haven't had time to get back to this. I will look at it soon and get back to you or close the issue. Apologize for the delay.