DimmestP/chimera-quantseq

Check why only 1/5 of reads are counted

Closed this issue · 5 comments

The majority of reads are either ignored because they do not overlap with a feature or they are multi-mapped.

Check strand used to count reads.

Use abundant 3'UTR rather than abundant (for whole transcript not just 3'UTR).

Look at raw BAM files in IGB and pick out 5-10 read ids. (On each strands, in 3'UTR, or overlapping ect). Then extract these reads and run featureCounts to see behaviour.

Check which genes are ignored due to multi-mapping or no features.

I have checked that:

  • featureCounts includes strandedness
  • the gff annotation contains the ORF + 3'UTR
  • hisat-2 alignment contains 10% or less multi-aligned reads
  • enabled featureCounts to double count multi-aligned reads and reads overlapping multiple features

Even with all of this the best I can do is have 40% of reads counted. The rest do no overlap a single feature (ORF + 3'UTR + 5'UTR).

Tomorrow (30th Jun) will select a handful of reads and find where they are being mapped to.

I realised that read 1 for the pairs is reversely stranded in quantseq REV which is important for featureCounts. I have changed the standedness appropriately and now get up to 70% of reads counted. The majority of the remaining counts are considered 'multimapped' so aren't counted. I opened the supposed multimapped genes in IGB and can see they the vast majority of these genes are in regions without annotations I have no idea why they are flagged as multimapped rather than no feature. This has highlighted that some genes are missing from Edwards 3'UTR annotation, which I will quickly investigate.

I have created a bam file of the paired reads that were not assigned to a gene and visualised them in IGB. It appears the majority of these reads are mapped to transposable elements or to the ends of chromosomes. I am therefore happy with the current 70% of reads counted.

Can close if @ewallace is happy

70% of reads aligned and the rest multi-mapped to TEs and chromosome ends is great!

Bonus points for a summary figure showing mapping to TEs and chromosome ends. For this project this is only relevant as a quality control because our question is about 3' ends of a small set of target genes. But in principle it's nice to have a clear figure that communicates what went wrong with the rest. So, only for bonus points / low priority.