Quantifying precision and sensitivity of bambu using spike in data
apsteinberg opened this issue · 4 comments
Hi there,
Thank you for the wonderful tool. I was trying to characterize the precision and sensitivity of bambu in my hands using some of the long read spike in data you have provided in the sg-nex aws bucket. I wanted to do a similar test to what you have performed in Supp Fig 1m of your Nature Methods paper, in which you show enhanced performance of bambu without any annotations for the spike in sequences. However, I am finding that when I run bambu using the pre-trained model and also when training on the Homo_sapiens.GRCh38.91.gtf annotations, that none of the spike in sequences are found in the resultant gtf. I first used gffcompare to compare the gtf file for the spike ins you've provided in the sg-nex aws bucket, and then I took a look manually and also did not find the spike in sequences. This is the code I am running:
fa.file <- "APS008_Archive/ref_genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
gtf.file <"APS008_Archive/ref_genome/Homo_sapiens.GRCh38.91.gtf"
annotations <- prepareAnnotations(gtf.file)
se <- bambu(reads = sample.bam, annotations = annotations, ncore = ncores,
genome = fa.file, NDR = 0.1, quant = FALSE,
opt.discovery = list(fitReadClassModel = FALSE))
writeToGTF(se, outpath)
If you could please advise on what I'm doing wrong here, I'd greatly appreciate your help!
Cheers,
Asher
Hi Asher,
Thanks for using the tool. The reason no spike-ins are appearing in your output is the sequences that spike-ins are derived from are usually an artificial synthetic chromosome that is added to the fasta file. I see you are using a standard fasta file which does not contain these sequences. Therefore these spike-in reads either have nowhere to align to (if you started from fastq) or bambu does not consider them (if you are starting from the bam files). You may have noticed some warnings saying there are reads aligning to chromosomes not found in your provided genome.
To solve this, depending on the samples you are looking at, get the spike in fasta files for Sequins or the SIRVs from the website of the manufacturer of the spike-ins and append them to your fasta file. Then rerun your aligner (if starting from fastq) or bambu (if starting from bam).
Hope this helps,
Andre
Hi Andre,
Thanks for the quick response and for your help! I see -- I was starting with bam files. I understand why the reads would have nowhere to align to if I started from the fastq files, but starting from the bam files, why does bambu not consider the spike-in sequences if they are not contained in the standard fasta? And a follow-up question -- so basically, to do transcript discovery without annotations, I can use a reference gtf that doesn't include the spike-ins, but the reference fasta must include the synthetic chromosome?
In practice though, this shouldn't be a problem when doing novel transcript discovery on human genomes, because there will always be a corresponding chromosome for novel transcripts to match to?
Thanks again!
Cheers,
Asher
Hi Asher,
so basically, to do transcript discovery without annotations, I can use a reference gtf that doesn't include the spike-ins, but the reference fasta must include the synthetic chromosome?
Correct
We use the fasta file to extract the genome sequences for the aligned reads which we use for various purposes such as determining the strand of reads using the intron region and the presence of runs of A/Ts outside the aligned region that can causes biases, etc. As the reads need to be aligned anyway we assume the user will have access to the fasta file which they aligned the reads too so we hope this requirement is not too much of a burden.
I hope this answers your question, and all the best in replicating the figures. I will close this for now but feel free to open a new issue or recomment on this one if something else arises.
Kind Regards,
Andre Sim
Definitely makes sense. Thank you again for getting back to me so promptly!
Cheers,
Asher