Genentech/midasHLA

Handle missing values

xiw588 opened this issue · 9 comments

Hi I have a question about how this package handle missing values in HLA types. For example, a sample may have only one HLA classI A allele available, and no DRA alleles at all. Should I replace all the blanks with NA or so to run through the functions, such as hlaToVariable()?

Thanks a lot!

When I tried to apply the function hlaToVariable(), I kept receiving the following similar error messages:

Error: values: , , , , in hla_calls doesn't follow HLA numbers specification

Can you help me with this?

Hi @xiw588
The package can handle missing values. If you are loading your data using readHlaCalls function it has na.strings argument were you can specify which values should be interpreted as NA. By default these are "Not typed", "-", "NA".
To make it interpret blanks as NA values use something like readHlaCalls(input_file, na.strings="").

Hope it helps!

Thank you so much for the swift response, this is really helpful!! But I just realized that the issue is due to the wrong format of my data, which means the alleles don't match exactly the columns when there was missing. Here I attached a sample file. Do you know if there is any way to easily fix this so that every sample would have the same number of columns and can be feed into your package?

sample.xlsx

Hi I figured this out but came around another issue when applying hlaCallsGranthamDistance() to the dataframe that I used for hlaToVariable(). I received the following error message. Do you have any idea of what is the wrong here? Thanks
" Error in alignment[allele1, ] : invalid subscript type 'list' "

Unfortunately, I do not recognize this error message. Could you provide some data so I can reproduce this error?

I figured this out!! Thanks! Actually, I have another question regarding the calculation of HED in your package. I found inconsistent with using https://sourceforge.net/projects/granthamdist/files/, but it seems that you used the same formula of Grantham Distance. Can you clarify on this please? Thanks

@xiw588 while I am not familiar with the tool you have provided, my guess would be that the differences may steam from different choice of peptide region used in the calculation. For more details I am referring you to midasHLA documentation on hlaCallsGranthamDistance:

...
Grantham distance is calculated only for class I HLA alleles. First
exons forming the variable region in the peptide binding groove are selected.
Here we provide option to choose either "binding_groove" - exon 2 and
3 (positions 1-182 in IMGT/HLA alignments, however here we take 2-182 as many
1st positions are missing), "B_pocket" -  residues 7, 9, 24, 25, 34,
45, 63, 66, 67, 70, 99 and "F_pocket" - residues 77, 80, 81, 84, 95,
116, 123, 143, 146, 147. Then all the alleles containing gaps, stop codons or
indels are discarded. Finally distance is calculated for each pair.
...

But that is all guessing, unless you provide a concrete example that I can reproduce on my side it is impossible for me to tell what is happening.

Hi Thank you so much for your response. This could be a possibility and I am wondering if the normalization process play a role too. For example A0101 and A2402, I got 10.60773 from your algorithm and 0.138121547 from the other.

Ok, so eventually I've tried granthamdist pearl scripts you were referencing and i got the same result in both tools:

  1. granthamdist
    inp.txt
» perl CalculateIndividualDivergence.pl -f HLA_ClassI_CWDonly.fas -i inp.txt

***********
Command line: -f HLA_ClassI_CWDonly.fas -i inp.txt
Genotype file: inp.txt
Fasta file: HLA_ClassI_CWDonly.fas
Distance matrix used: AAdistMatrix_Grantham.cnv
***********

Calculating distances...

Calculations finished (1 genotypes processed) and output file created:
        inp.txt.IndividualDivergence.txt

» cat inp.txt.IndividualDivergence.txt
ID      AlleleNumber    Divergence_Average      Divergence_Sum  AlleleList
ID2     2       10.6077348066298        10.6077348066298        A2402   A0101
  1. midasHLA
> df <- data.frame(ID = c("S1", "S2"), A_1 = c("A*01:01", "A*24:02"), A_2 = c("A*24:02", "A*01:01"))
> hlaCallsGranthamDistance(df, "A")
   ID        A
1 S1 10.60773
2 S2 10.60773

I am closing this issue since it diverged from the original topic of handling missing values.