Phelimb/atlas

Genotyping Errors

Closed this issue · 32 comments

Screened all Versions of OXA by inserting them into the reference genome ecoli_K12_MG1655_ref and generating error-free paired end reads with wgsim. Then the simulated reads were genotyped with atlas and the summary is in the table below.
It is sorted to first show the versions that were not found at all, then the ones which were found but the version was estimated incorrectly and lastly the ones which were found and typed right.
atlas_OXA_detection.xlsx

Not found were:
184, 185, 266, 269, 270, 271, 272, 273, 274, 275, 277, 279, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 302, 303, 304, 305, 306, 307, 308, 421

ecoli_oxa2.txt (identified correctly)
ecoli_oxa4.txt (found but identified wrong)
ecoli_oxa184.txt (not found)

@Phelimb - any word on this? I'm surprised with 30x there are so many we do not detect at all. I took a look at the JSON output for one ( @ronald-jaepel - could you please add some example JSONS in a comment? Or add them to the original report above? ) - the one I looked at showed we only saw 8% of the kmers in the true allele, and this is with error-free reads.

two diagrams, same data, different scales, depicting the certainty of the calls:
certainty_rightwrongmissed
certainty_rightwrongmissed2

Thanks @ronald-jaepel - very informative plots

All the mistakes are due to differences in the catalogs! Sorry about the confusion! Should have checked that earlier.
So, except for three genes, all faults were due to the genes in Nicole's database not being in the reference catalog used by atlas.
The three exceptions are:
4, which is identical in both datasets, and appears to be the only true mistake made by atlas. (edit) was a padding problem.
67 is a different sequences in our reference (with 16 SNPs) compared to the Nicole's database. (In our reference cataloge 67's sequence is the same as 69's sequence)
94 doesn't have the last bases: AGTTAGTTTATAGCCCGGTTCACTTTTTACCA in Nicole's Database

Sorry guys, I missed this when it was first posted.

Thanks for the update Ronald. We should have a consistent way of versioning the database - which one of Nicoles have you used for the above analysis. I should have told you that the catalog in atlas is an older one - sorry! It comes as an excel spreadsheet so it's a bit of a pain to convert so I haven't updated it to the latest yet.

I have a script to convert here:

https://cycloid.well.ox.ac.uk:9999/ipython/notebooks/phelim/parsing_gram_neg_resistance_profile.ipynb#

Though I think Nicole has sent around one recently that should be easier to parse so we might be better off starting from scratch writing a converter from spreadsheet -> fasta.

I used "combined_resistance_database_09022016.xlsx"

I think we need spreadsheet to fasta to github and then refer to that by changelist. Do we want a separate repo for tracking catalogs?

@iqbal-lab I'd rather separate it from the predictor and atlas codebases so I think a separate repo is a good idea.

agreed

so, apparently wgsim had a problem the position of genes in the genome:
I previously only appended them at the end and when I repeat the experiment with OXA-4 at the end of the genome, it gets diagnosed as OXA-33 over and over again.
But: If I copy it just a few lines up, so that more than 100 bases follow it, it gets called correctly.

Ah ok. So we might as well pad the genes before and after in future. Well spotted.
So on error-free data 100% win for Atlas?

So, alignment of OXA-4 and OXA-33 shows, that they are identical (i.e. have no SNPs or INDELS) except for OXA-33 missing 17 bases at the back end and 45 bases at the start compared to OXA-4. Therefore, calling 4 as 33 is really just due to missing bits at either ends.
http://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20160415-155706-0969-80534339-oy/aln-clustal
Mapping the reads generated by wgsim onto the reference genome also shows, that the coverage goes down from 3 to 1 reads per base for the last 24 bases. This screenshot shows the last 138 bp of OXA-4 appended to the reference genome and all reads that were mapped to it.
oxa4-coverage-lastbases

Can you email Nicole and cc me/Phelim to see if this is a bug? Looks to me like an error in the allele database

I have more insight into the genes in the catalogue and two questions before I write to Nicole:

regarding OXA-4 vs OXA-33:
Nicoles Database just transfered the data from the ascension number they cited. OXA-33 is cited from NCBI as an "OXA-1-like" gene, that Nicole calls OXA-33.
(Source: http://www.ncbi.nlm.nih.gov/nuccore/AY008291 )

(Here's the alignment of OXA-4 vs OXA-33 taken from NCBI and Nicoles Database
http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=clustalo-I20160418-135502-0628-18285893-oy (the TAA where OXA-33 ends isn't in frame and therefore not a stop codon))

I'll ask her where she got the name OXA-33, is there anything else I should ask?

regarding OXA-94, which is shorter in Nicole's database:
Her sequence stops after the stop codon, ours has a bit of 3' UTR included:
http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=clustalo-I20160418-133224-0060-23120442-pg

Should I write to her about that or does that just mean that our version is too long?

Thank you for working through this with me!

Sounds to me like OXA-33 doesn't really exist in any sense differently to OXA-4, except in Nicole's database. (Agree @Phelimb ?). Ask Nicole why we cannot delete it (cc me and Phelim).

Re: OXA-94 - I think if ours is identical up to stop codon, then we can just take our longer one.

I agree and look like Nicole does too. I don't envy her trying to keep that database up to date - it's no easy task. Awesome job in debugging this @ronald-jaepel - will make all this comparision stuff much easier if the genes in the panel ar unique!

ok, so now that OXA-33 can be deleted, we resolved all errors that were made by Atlas genotype on the first batch, right?

I have new data on the other genes (ctxm, kpc, shv, tem, (oxa)):
This time I used our reference catalogue to generate the simulated genomes-reads-etc.
Almost ALL genes were called correctly. except for the 5 following:

Errors that are due to genes having alternative names (even with their own entry in Nicole's database specifying the alternative names):

  • shv121 got called as shv136
  • tem9 got called as tem187
    so these are true duplicates.

Errors that are due to the sequences being identical in our catalogue, although there exist different sequences in Nicole's database:

  • ctxm5 got called as ctxm4 (Nicole's CTX-M-4 is different from our CTXM-4. our CTXM-4 is identical to both CTXM-5s)
  • oxa67 got called as oxa69 (Nicole's OXA-67 is different from our OXA-67)
    so these are faults in our reference catalogue.

Errors that I can't figure out:

We definitely need some validation code to run on the Nicole db, and to check for new issues when the db is updated. What is "the other genes" - all genes? The big 5? Anyway, this is great news.

Yes, sorry, forgot to mention that. "Other genes" are the big 5 for which alleles are present in our gn-amr-genes.fasta catalogue. (edited my previous comment to make things clearer)

with 5% error rate during read generation, Atlas genotype got 6 errors out of the 704 genes in the reference catalogue:
(in addition to the 5 from above)

  • ecoli_oxa248 got called as oxa69 (99.88% DNA, 1SNP)
  • ecoli_kpc5 got called as kpc2 (99.89% DNA, 1SNP, 1 AA change)
  • ecoli_shv141 got called as shv163 (99.77% DNA, 2SNPs at the very end, 1 AA change)
  • ecoli_shv144 got called as shv38 (99.88% DNA, 1SNP, 1 AA change)
  • ecoli_tem122 got called as tem163 (99.88% DNA, 1SNP, 1 AA change)
  • ecoli_tem68 got called as tem47 (99.88% DNA, 1SNP, 1 AA change)

(wgsim was set to a base error rate of 0.05, standard is 0.02)

Interesting. It might be good to have a look at the coverages on the whole panel.

If you run with the --keep_tmp option then it doesn't remove the temporary files after completing.

In the temp directory there will be files called :sample_id_:kmer_dize_:panel_name.covgs. The columns are:

gene_name    colour    median_covg    min_non_zero_covg    percent_coverage

It would be interesting to see the rows for the "truth" call and the one predicted by atlas.

first line is the "truth", second line is the prediction by atlas
tmp_call_info.txt
kpc?name=kpc&version=5 0 5 1 0.986063
kpc?name=kpc&version=2 0 5 1 0.986063

oxa?name=oxa&version=248 0 5 1 0.980099
oxa?name=oxa&version=69 0 5 1 0.980099

shv?name=shv&version=144 0 6 1 0.994048
shv?name=shv&version=38 0 6 1 0.994048

shv?name=shv&version=141 0 5 1 0.990476
shv?name=shv&version=163 0 5 1 0.991637

tem?name=tem&version=68 0 6 1 0.985714
tem?name=tem&version=47 0 6 1 0.997619

tem?name=tem&version=122 0 5 1 0.978571
tem?name=tem&version=163 0 5 1 0.978571

tem?name=tem&version=177 0 4 1 0.704762
tem?name=tem&version=191 0 5 1 0.881793

So you have median coverage between 1 and 5 and PERCENT coverage <1% ?

I thought that was partial coverage, so percentage coverage of e.g. 98%, because I never saw a value above 1.

What I found today is that atlas walk does better than atlas genotype. It only makes 3 errors out of the 7 listed above (including tem117 -> tem191) when run with one of the two files of paired end reads. It currently crashes when I try to run it with both files due to memory limitations.
Also, the mistakes change, depending on which part of the paired-end-reads was used.

Paired_end_1 as single_read:
image
Paired_end_2 as single_read:
image
for this run I also have the covgs info:
kpc?name=kpc&version=5 0 1 1 0.676851

shv?name=shv&version=141 0 1 1 0.749398
shv?name=shv&version=163 0 1 1 0.752116

tem?name=tem&version=68 0 1 1 0.681928
tem?name=tem&version=47 0 1 1 0.712048

tem?name=tem&version=122 0 1 1 0.680723
tem?name=tem&version=192 0 1 1 0.702628

Yes, that's 99% not 0.99%.

Interesting that one of the problems seems to be that we fall back on the first version if two versions are indistinguishable.

Happy and slightly surprised to see atlas walk doing so well - it's still a very rough algorithm!

What do you mean by "which part of the paired-end-reads"? are you not running with both read sets together? I'd recommend doing this.

The mistaking TEM122 for TEM192 is odd given they only have 87.5% sequence identity

Ping @Phelimb - outstanding genotyping bug here with errors which are a long way from the truth

The mistaking TEM117 for TEM192 (with 87.5% identity) couldn't be reproduced when the coverage was raised to 50x with 0.5% errors (instead of 5%) and trailing N's were removed. Might have been a coverage issue.

Yeah, might also just be result of that particular simulation. I.e. low coverage on that particular gene by chance. With 20x and 0.5% errors this isn't implausible.

I think 50x is a reasonable coverage for these simulations. Most of the data we have are >50x and very few are < 20x. We should be able to do well for low coverage samples soon but I think we should debug issues at higher coverage first. @iqbal-lab do you agree?

I think there are two separate issues.

  1. What are the simulations for
  2. How should atlas cope with low coverage.

Item 2 I will raise a separate issue about - now is a good time to refresh and improve our approach to coverage and mixture, and we should have a probabilistic output we can look at with confidence. I'l raise an issue with a proposal.

This simulation is intended to

  1. Confirm that with reasonable coverage (50x fine) we correctly genotype the allele type in clonal samples
  2. Determine whether the walk algorithm has misassembly issues.
    (@Phelimb - could you please document formally what the walk algorithm is in pseudocode and bring it to our next personal meeting btw?)

So I think this issue OK to close

@ronald-jaepel this bug is closed. Are there any examples of Atlas walk wrongly finding an allele using a sensible error rate and coverage? I was under the impression there were examples, but this bug was closed because it was not reproducible with error rate 0.5% and coverage 50x. @Phelimb