Check why only 1/5 of reads are counted
Closed this issue · 5 comments
The majority of reads are either ignored because they do not overlap with a feature or they are multi-mapped.
Check strand used to count reads.
Use abundant 3'UTR rather than abundant (for whole transcript not just 3'UTR).
Look at raw BAM files in IGB and pick out 5-10 read ids. (On each strands, in 3'UTR, or overlapping ect). Then extract these reads and run featureCounts to see behaviour.
Check which genes are ignored due to multi-mapping or no features.
I have checked that:
- featureCounts includes strandedness
- the gff annotation contains the ORF + 3'UTR
- hisat-2 alignment contains 10% or less multi-aligned reads
- enabled featureCounts to double count multi-aligned reads and reads overlapping multiple features
Even with all of this the best I can do is have 40% of reads counted. The rest do no overlap a single feature (ORF + 3'UTR + 5'UTR).
Tomorrow (30th Jun) will select a handful of reads and find where they are being mapped to.
I realised that read 1 for the pairs is reversely stranded in quantseq REV which is important for featureCounts. I have changed the standedness appropriately and now get up to 70% of reads counted. The majority of the remaining counts are considered 'multimapped' so aren't counted. I opened the supposed multimapped genes in IGB and can see they the vast majority of these genes are in regions without annotations I have no idea why they are flagged as multimapped rather than no feature. This has highlighted that some genes are missing from Edwards 3'UTR annotation, which I will quickly investigate.
I have created a bam file of the paired reads that were not assigned to a gene and visualised them in IGB. It appears the majority of these reads are mapped to transposable elements or to the ends of chromosomes. I am therefore happy with the current 70% of reads counted.
70% of reads aligned and the rest multi-mapped to TEs and chromosome ends is great!
Bonus points for a summary figure showing mapping to TEs and chromosome ends. For this project this is only relevant as a quality control because our question is about 3' ends of a small set of target genes. But in principle it's nice to have a clear figure that communicates what went wrong with the rest. So, only for bonus points / low priority.