Questions. Extracting reads for tx-models

Question

Questions. Extracting reads for tx-models

sschmeier opened this issue 6 years ago · 3 comments

Thanks for the tool. What I am particular interested is to extract the reads from the original bam files that were used to create a tx-model. I have used --verbose 2 and I think the information can be found in the output. Is this correct?

What I was wondering is what the best approach is when dealing with several bam-files, lets say replicates. I tried two way:

Run scallop on each file, merge the gtf files using gtfmerge. However, now the resulting gtf and gene-ids, do not match anymore the original ones found in the gtfs from the individual runs.
Merge the bam-files first to create one big sorted bam. Run scallop only once on this file. I hope that the resulting gft gene-ids are the ones found in the --verbose 2 log.

Do you have any suggestions on what the best approach here is? Many resulting tx-models are very similar between the two approaches but of course there are differences. However, currently I think the approach 1. cannot give me the read ids of the reads used for the final tx-models.

Thanks

Answer 1 · 2019-04-03T20:19:55.000Z

Hi there,
I have a similar question: What is the appropriate way to use scallop with multiple samples? For context, I have samples from 6 different tissue types, 10 individual samples of each. Do I run scallop on each bam file individually, and then use gffcompare to merge and evaluate the assembled transcripts? Or do I need to merge the bam files and run scallop on the merged file?
Thanks in advance for your help and for the great tool!
Sarah

Answer 2 · 2019-04-04T15:47:25.000Z

Hi Sarah and Sebastian,

Thank you for using Scallop!

For assembling multiple samples, I think you can try both ways: either first assemble individual sample and then merge the assembled transcripts, or first merging aligned reads and then assemble the combined bam file. For now do not have strong experimental results to conclude which way is better. We have some initial results suggesting that the first option (individual assembly followed by merging) works well (even better than some meta-assemblers in certain cases).

You can try our tool gtfmerge (https://github.com/Kingsford-Group/rnaseqtools) to merge the assembled transcripts from each individual sample. As Sebastian pointed out, the gene-id and transcript-id in the merged gtf file does not make much sense any more. As each transcript is annotated with coordinates, you can still (using other tools) cluster them into "genes" or map to existing gene annotations.

Best,
Mingfu

Answer 3 · 2019-05-14T04:06:45.000Z

Hey, any idea for the question of extracting reads that were used to build a tx-model?

Looking at the final merged tx and overlaying bam-files one can only guess which reads contributed to a tx...