
Issue with database or genotype calibration

Opened this issue · 9 comments

Ron has a simulation where all of his false Atlas genotype calls have high genotype confidence

Results at 10x

Note this is normalised - so there may be hundreds of correct calls and only 1 false call

Thanks. Pretty strange. @ronald-jaepel any chance could you point me at the data or command lines run? Would make life much easier if I could have a look at the raw data too.

This turns out to be just 2 false calls. But odd

Hey @Phelimb , I think we need to know from @ronald-jaepel the precise 2 errors at 10x, and 5 at 5x

ok, the used genome is at /data1/projects/ronald_jaepel/atlas_test/Simulation_Products/generated_genomes/2016_07_26_1353/ecoli_K12_MG1655_ref_all_families.fa
the reads are at /data1/projects/ronald_jaepel/atlas_test/Simulation_Products/simulated_reads/2016_07_26_1353/
cortex graphs are at /data1/projects/ronald_jaepel/atlas_test/Simulation_Products/simulated_graphs/2016_07_26_1353/
atlas JSONS are at
/data1/projects/ronald_jaepel/atlas_test/JSONS/2016_07_26_1353/ *walk3.txt

the number behind ecoli_all_families_ is the number of reads. ecoli_all_families_33000.k31.ctx e.g. is the cortex graph for 10x coverage. ecoli_all_families_16000.k31.ctx is 5x, ecoli_all_families_10000.k31.ctx is 3x.

This is the list of the genes that should be in each sample:

Here's a better overview of the errors:

at coverage 2.6 : wrong gene found of aac6I Iai as Ib3 with certainty 10.1322008044
at coverage 2.6 : wrong gene found of aadA6aadA10 a as b with certainty 10.1322008044
at coverage 2.6 : wrong gene found of GES 24 as 6 with certainty 7.42415060331
at coverage 2.6 : wrong gene found of OXY5 1 as 2 with certainty 7.42415060331
at coverage 2.6 : wrong gene found of aadA11 a as b with certainty 4.7161004022
at coverage 2.6 : wrong gene found of vat A as APEC01 with certainty 2.0080502011
at coverage 2.6 : wrong gene found of LEN 24 as 6 with certainty 2.0080502011

at coverage 3.2 : wrong gene found of aac6I Iai as Ib3 with certainty 12.8402510055
at coverage 3.2 : wrong gene found of DHA 10 as 20 with certainty 2.0080502011
at coverage 3.2 : wrong gene found of GES 24 as 6 with certainty 7.42415060331
at coverage 3.2 : wrong gene found of OXY5 1 as 2 with certainty 10.1322008044
at coverage 3.2 : wrong gene found of aadA11 a as b with certainty 7.42415060331
at coverage 3.2 : wrong gene found of vat A as APEC01 with certainty 2.0080502011

at coverage 5.1 : wrong gene found of aac6I Iai as Ib3 with certainty 20.2644016088
at coverage 5.1 : wrong gene found of GES 24 as 7 with certainty 12.1402510055
at coverage 5.1 : wrong gene found of OXY5 1 as 2 with certainty 14.8483012066
at coverage 5.1 : wrong gene found of aadA11 a as b with certainty 12.1402510055
at coverage 5.1 : wrong gene found of vat A as APEC01 with certainty 4.0161004022

at coverage 10.5 : wrong gene found of aac6I Iai as Ib3 with certainty 53.3690542231
at coverage 10.5 : wrong gene found of vat A as APEC01 with certainty 7.33220080441

The problem with aac6I Iai -> Ib3 might be that the aac families (aac6I, aac6IIb etc) haven't been put into one gene family but split into these multiple families. Therefore there are multiple aac6-like alleles in the sample and that might lead to the error we see. I don't know about vat A though.