Assess quality of GenSAS & GAWN annotations, choose one to use in analyses
Closed this issue · 15 comments
- running RNASeq alignment with each annotation data to see how alignment rates compare
- run BUSCO
Run RNAseq through HiSat / STAR with the annotated genomes to assess the quality of the annotations. Which one do we want to use -- GenSAS or GAWN? Potentially run BUSCO as well on the annotated genes.
@sr320 with @kubu4 will pull the fastas from the annotations
@ggoetznoaa will run Hisat and/or STAR
IGV is a possibility to further examine alignment of RNAseq to the annotated genes, if needed
I vote for GenSAS - #25 (comment)
GAWN if very wonky. @ggoetznoaa what is your assessment from alignment?
@sr320 I didn't realize the gtf/gff files were ready, I haven't done any alignments yet. Where can I grab them so I can do the alignments? @laurahspencer
@ggoetznoaa check out #25 for links to all annotation files from both GAWN and GenSaS, including the gene .gff's. Let me know if you have questions!
Ok, I see only one gff3 file for the GAWN output but I see two for the GenSAS. I'm assuming the correct one to use is the Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 file since the other one doesn't seem to include any notes about exons, CDS, or genes. Correct me if I'm wrong please.
@ggoetznoaa yes that's the correct .gff from GenSaS (the other one is for repeat elements). Thanks!
@sr320 @laurahspencer @kristamnichols
So here are the results from using STAR with the scaffold only reference + the gff files. Note, I had to convert the gff files to gtf files using gffread for STAR.
GAWN:
Started job on | May 18 07:48:03
Started mapping on | May 18 07:48:08
Finished on | May 18 20:26:45
Mapping speed, Million of reads per hour | 49.72
Number of input reads | 628684248
Average input read length | 291
UNIQUE READS:
Uniquely mapped reads number | 167964190
Uniquely mapped reads % | 26.72%
Average mapped length | 274.46
Number of splices: Total | 57667901
Number of splices: Annotated (sjdb) | 51288294
Number of splices: GT/AG | 54740890
Number of splices: GC/AG | 353673
Number of splices: AT/AC | 38370
Number of splices: Non-canonical | 2534968
Mismatch rate per base, % | 1.16%
Deletion rate per base | 0.17%
Deletion average length | 2.74
Insertion rate per base | 0.22%
Insertion average length | 3.24
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10658119
% of reads mapped to multiple loci | 1.70%
Number of reads mapped to too many loci | 478838
% of reads mapped to too many loci | 0.08%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 449217499
% of reads unmapped: too short | 71.45%
Number of reads unmapped: other | 365602
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
GENSAS
Started job on | May 18 07:48:26
Started mapping on | May 18 07:48:48
Finished on | May 18 20:22:40
Mapping speed, Million of reads per hour | 50.04
Number of input reads | 628684248
Average input read length | 291
UNIQUE READS:
Uniquely mapped reads number | 168192046
Uniquely mapped reads % | 26.75%
Average mapped length | 274.23
Number of splices: Total | 51522424
Number of splices: Annotated (sjdb) | 9160977
Number of splices: GT/AG | 49391809
Number of splices: GC/AG | 268932
Number of splices: AT/AC | 17672
Number of splices: Non-canonical | 1844011
Mismatch rate per base, % | 1.16%
Deletion rate per base | 0.17%
Deletion average length | 2.74
Insertion rate per base | 0.22%
Insertion average length | 3.23
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10201122
% of reads mapped to multiple loci | 1.62%
Number of reads mapped to too many loci | 515333
% of reads mapped to too many loci | 0.08%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 449406169
% of reads unmapped: too short | 71.48%
Number of reads unmapped: other | 369578
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
I used default settings. The only thing I could think of changing at this point would be sjdbOverhang.
Excerpt about it from the STAR manual.
--sjdbOverhang specifies the length of the genomic sequence around the annotated junction
to be used in constructing the splice junctions database. Ideally, this length should be equal
to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina
2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the
ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as
well as the ideal value.
What you see is all that I got out of STAR. There is a SJ.out.tab file for each run but that is talking about splice junctions. There is a quant mode I can run that will count the number of reads that map to 'genes'. Will that work for you?
With --quantMode GeneCounts option STAR will count number reads per gene while mapping.
A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-
end read are checked for overlaps. The counts coincide with those produced by htseq-count with
default parameters
Ok, so I remapped the reads to the scaffolds only reference using the gff files from GAWN and GENSAS using STAR's quantmode GeneCounts. Here is what I got.
From the STAR manual, so you understand the different columns
STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
GAWN:
Started job on | May 19 10:11:14
Started mapping on | May 19 10:11:21
Finished on | May 19 22:57:28
Mapping speed, Million of reads per hour | 49.24
Number of input reads | 628684248
Average input read length | 291
UNIQUE READS:
Uniquely mapped reads number | 167964190
Uniquely mapped reads % | 26.72%
Average mapped length | 274.46
Number of splices: Total | 57667901
Number of splices: Annotated (sjdb) | 51288294
Number of splices: GT/AG | 54740890
Number of splices: GC/AG | 353673
Number of splices: AT/AC | 38370
Number of splices: Non-canonical | 2534968
Mismatch rate per base, % | 1.16%
Deletion rate per base | 0.17%
Deletion average length | 2.74
Insertion rate per base | 0.22%
Insertion average length | 3.24
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10658119
% of reads mapped to multiple loci | 1.70%
Number of reads mapped to too many loci | 478838
% of reads mapped to too many loci | 0.08%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 449217499
% of reads unmapped: too short | 71.45%
Number of reads unmapped: other | 365602
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
GeneCounts
N_unmapped 450061939 450061939 450061939
N_multimapping 10658119 10658119 10658119
N_noFeature 36265737 82925982 82750532
N_ambiguous 78636285 48312084 48287630
N_genecounts 53062168 36726124 36926028
total 628684248 628684248 628684248
8.4% unstranded mapping
GENSAS
Started job on | May 19 10:11:19
Started mapping on | May 19 10:11:27
Finished on | May 19 22:49:15
Mapping speed, Million of reads per hour | 49.78
Number of input reads | 628684248
Average input read length | 291
UNIQUE READS:
Uniquely mapped reads number | 168192046
Uniquely mapped reads % | 26.75%
Average mapped length | 274.23
Number of splices: Total | 51522424
Number of splices: Annotated (sjdb) | 9160977
Number of splices: GT/AG | 49391809
Number of splices: GC/AG | 268932
Number of splices: AT/AC | 17672
Number of splices: Non-canonical | 1844011
Mismatch rate per base, % | 1.16%
Deletion rate per base | 0.17%
Deletion average length | 2.74
Insertion rate per base | 0.22%
Insertion average length | 3.23
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10201122
% of reads mapped to multiple loci | 1.62%
Number of reads mapped to too many loci | 515333
% of reads mapped to too many loci | 0.08%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 449406169
% of reads unmapped: too short | 71.48%
Number of reads unmapped: other | 369578
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
GeneCounts
N_unmapped 450291080 450291080 450291080
N_multimapping 10201122 10201122 10201122
N_noFeature 141759082 154858765 155055338
N_ambiguous 963641 462265 465057
N_genecounts 25469323 12871016 12671651
total 628684248 628684248 628684248
4.1% unstranded mapping
Ok, so update time. I've run both STAR using different settings as well as HISAT2 using default settings using both the GAWN and GENSAS gtf files. The STAR settings I used are as follows
--outFilterScoreMinOverLread 0.33 \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 2 \
I grabbed these from some searches on the internet, in some cases people went as far as settings the first two to 0.
STAR GAWN
37% overall mapping (33.88% unique, 3.10% multi)
STAR GENSAS
37% overall mapping (33.95% unique, 3.03% multi)
I used the default settings for HISAT2 and I used the hisat2_extract_exons.py and the hisat2_extract_splice_sites.py script to extract the exon and splice site info from the gtf files.
HISAT2 GAWN
Average mapping rate 23.64%
HISAT2 GENSAS
Average mapping rate 23.69%
Decision is to use the GenSaS annotation file.