shimbalama/SBSC

locations with multiple ALTs produce duplicates in filtered outputs

Closed this issue · 0 comments

e.g.

tail -n+2 chr22_KI270734v1_random_001_9_12_10.tsv|awk -F"\t" '{print $15"\t"$16"\t"$2"\t"$3"\t"$4}' |sort |uniq -c|sort -n|tail
      1 chr22_KI270734v1_random	122129	T	C	T
      1 chr22_KI270734v1_random	122134	T	C	T
      1 chr22_KI270734v1_random	122208	T	C	G
      1 chr22_KI270734v1_random	122233	G	C	G
      1 chr22_KI270734v1_random	122241	A	G	A
     64 chr22_KI270734v1_random	120663	G	A	G
     64 chr22_KI270734v1_random	120663	G	C	G
     64 chr22_KI270734v1_random	121454	T	AA	T
     64 chr22_KI270734v1_random	121692	G	TT	G
    729 chr22_KI270734v1_random	121191	T	AA	T

The last few positions have dupes - 64 in the first 4 cases, and 729 in the last. Since 64=2^6 and 729=3^6, I'm guessing this is soemthing like a combinatorial problem - e.g. two records getting expanded to 64, and three to 729.

These duplicates are not present in the unfiltered variants JSON.
chr22_KI270734v1_random.zip