Assess quality of GenSAS & GAWN annotations, choose one to use in analyses

Question

Assess quality of GenSAS & GAWN annotations, choose one to use in analyses

Closed this issue 2 years ago · 15 comments

Answer 1 · 2023-04-25T21:09:37.000Z

running RNASeq alignment with each annotation data to see how alignment rates compare
run BUSCO

Answer 2 · 2023-04-25T21:09:41.000Z

Run RNAseq through HiSat / STAR with the annotated genomes to assess the quality of the annotations. Which one do we want to use -- GenSAS or GAWN? Potentially run BUSCO as well on the annotated genes.
@sr320 with @kubu4 will pull the fastas from the annotations
@ggoetznoaa will run Hisat and/or STAR

Answer 3 · 2023-04-25T21:11:22.000Z

IGV is a possibility to further examine alignment of RNAseq to the annotated genes, if needed

Answer 4 · 2023-05-17T01:11:03.000Z

I vote for GenSAS - #25 (comment)

GAWN if very wonky. @ggoetznoaa what is your assessment from alignment?

Answer 5 · 2023-05-17T13:23:32.000Z

@sr320 I didn't realize the gtf/gff files were ready, I haven't done any alignments yet. Where can I grab them so I can do the alignments? @laurahspencer

Answer 6 · 2023-05-17T16:31:45.000Z

@ggoetznoaa check out #25 for links to all annotation files from both GAWN and GenSaS, including the gene .gff's. Let me know if you have questions!

Answer 7 · 2023-05-17T17:11:12.000Z

Ok, I see only one gff3 file for the GAWN output but I see two for the GenSAS. I'm assuming the correct one to use is the Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 file since the other one doesn't seem to include any notes about exons, CDS, or genes. Correct me if I'm wrong please.

Answer 8 · 2023-05-17T17:18:29.000Z

@ggoetznoaa yes that's the correct .gff from GenSaS (the other one is for repeat elements). Thanks!

Answer 9 · 2023-05-19T16:42:56.000Z

@sr320 @laurahspencer @kristamnichols
So here are the results from using STAR with the scaffold only reference + the gff files. Note, I had to convert the gff files to gtf files using gffread for STAR.

GAWN:

                                 Started job on |	May 18 07:48:03
                             Started mapping on |	May 18 07:48:08
                                    Finished on |	May 18 20:26:45
       Mapping speed, Million of reads per hour |	49.72

                          Number of input reads |	628684248
                      Average input read length |	291
                                    UNIQUE READS:
                   Uniquely mapped reads number |	167964190
                        Uniquely mapped reads % |	26.72%
                          Average mapped length |	274.46
                       Number of splices: Total |	57667901
            Number of splices: Annotated (sjdb) |	51288294
                       Number of splices: GT/AG |	54740890
                       Number of splices: GC/AG |	353673
                       Number of splices: AT/AC |	38370
               Number of splices: Non-canonical |	2534968
                      Mismatch rate per base, % |	1.16%
                         Deletion rate per base |	0.17%
                        Deletion average length |	2.74
                        Insertion rate per base |	0.22%
                       Insertion average length |	3.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	10658119
             % of reads mapped to multiple loci |	1.70%
        Number of reads mapped to too many loci |	478838
             % of reads mapped to too many loci |	0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	449217499
                 % of reads unmapped: too short |	71.45%
                Number of reads unmapped: other |	365602
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

GENSAS

                                 Started job on |	May 18 07:48:26
                             Started mapping on |	May 18 07:48:48
                                    Finished on |	May 18 20:22:40
       Mapping speed, Million of reads per hour |	50.04

                          Number of input reads |	628684248
                      Average input read length |	291
                                    UNIQUE READS:
                   Uniquely mapped reads number |	168192046
                        Uniquely mapped reads % |	26.75%
                          Average mapped length |	274.23
                       Number of splices: Total |	51522424
            Number of splices: Annotated (sjdb) |	9160977
                       Number of splices: GT/AG |	49391809
                       Number of splices: GC/AG |	268932
                       Number of splices: AT/AC |	17672
               Number of splices: Non-canonical |	1844011
                      Mismatch rate per base, % |	1.16%
                         Deletion rate per base |	0.17%
                        Deletion average length |	2.74
                        Insertion rate per base |	0.22%
                       Insertion average length |	3.23
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	10201122
             % of reads mapped to multiple loci |	1.62%
        Number of reads mapped to too many loci |	515333
             % of reads mapped to too many loci |	0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	449406169
                 % of reads unmapped: too short |	71.48%
                Number of reads unmapped: other |	369578
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

I used default settings. The only thing I could think of changing at this point would be sjdbOverhang.

Excerpt about it from the STAR manual.

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction
to be used in constructing the splice junctions database. Ideally, this length should be equal
to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina
2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the
ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as
well as the ideal value.

Answer 10 · 2023-05-19T16:47:52.000Z

Thanks - do you have information on the percent reads mapped to genes?

Answer 11 · 2023-05-19T16:54:11.000Z

What you see is all that I got out of STAR. There is a SJ.out.tab file for each run but that is talking about splice junctions. There is a quant mode I can run that will count the number of reads that map to 'genes'. Will that work for you?

With --quantMode GeneCounts option STAR will count number reads per gene while mapping.
A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-
end read are checked for overlaps. The counts coincide with those produced by htseq-count with
default parameters

Answer 12 · 2023-05-19T17:07:57.000Z

Yes please - looking for gene annotation accuracy here

On Fri, May 19, 2023 at 9:54 AM ggoetznoaa ***@***.***> wrote: What you see is all that I got out of STAR. There is a SJ.out.tab file for each run but that is talking about splice junctions. There is a quant mode I can run that will count the number of reads that map to 'genes'. Will that work for you? With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired- end read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/laurahspencer/DuMOAR/issues/24*issuecomment-1554907548__;Iw!!K-Hz7m0Vt54!n52C9eDbIo4OI-Hi9J_7xp6QH0a4Eu3j9ellQYsFlv3BAjxE9tGJQJbrCn6ti3jJ3Pwya5xlNn0gguNQEw-5CAM$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PNZA52XKEIDP3ZXCOZDXG6QT3ANCNFSM6AAAAAAXLQRIJ4__;!!K-Hz7m0Vt54!n52C9eDbIo4OI-Hi9J_7xp6QH0a4Eu3j9ellQYsFlv3BAjxE9tGJQJbrCn6ti3jJ3Pwya5xlNn0gguNQyPFXG2I$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Steven B. Roberts, Professor Associate Director - Graduate Program Coordinator School of Aquatic and Fishery Sciences University of Washington Fisheries Teaching and Research (FTR) Building - Office 232 1140 NE Boat Street - Seattle, WA 98105 robertslab.info - ***@***.*** - @sr320 vm:206.866.5141 - cell:360.362.3626 schedule a zoom call: https://d.pr/gsgxVJ

Answer 13 · 2023-05-22T14:23:34.000Z

Ok, so I remapped the reads to the scaffolds only reference using the gff files from GAWN and GENSAS using STAR's quantmode GeneCounts. Here is what I got.

From the STAR manual, so you understand the different columns

STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

GAWN:

                                 Started job on |	May 19 10:11:14
                             Started mapping on |	May 19 10:11:21
                                    Finished on |	May 19 22:57:28
       Mapping speed, Million of reads per hour |	49.24

                          Number of input reads |	628684248
                      Average input read length |	291
                                    UNIQUE READS:
                   Uniquely mapped reads number |	167964190
                        Uniquely mapped reads % |	26.72%
                          Average mapped length |	274.46
                       Number of splices: Total |	57667901
            Number of splices: Annotated (sjdb) |	51288294
                       Number of splices: GT/AG |	54740890
                       Number of splices: GC/AG |	353673
                       Number of splices: AT/AC |	38370
               Number of splices: Non-canonical |	2534968
                      Mismatch rate per base, % |	1.16%
                         Deletion rate per base |	0.17%
                        Deletion average length |	2.74
                        Insertion rate per base |	0.22%
                       Insertion average length |	3.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	10658119
             % of reads mapped to multiple loci |	1.70%
        Number of reads mapped to too many loci |	478838
             % of reads mapped to too many loci |	0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	449217499
                 % of reads unmapped: too short |	71.45%
                Number of reads unmapped: other |	365602
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

GeneCounts
N_unmapped     450061939 450061939 450061939
N_multimapping  10658119  10658119  10658119
N_noFeature     36265737  82925982  82750532
N_ambiguous     78636285  48312084  48287630
N_genecounts    53062168  36726124  36926028
total          628684248 628684248 628684248

8.4% unstranded mapping

GENSAS

                                 Started job on |	May 19 10:11:19
                             Started mapping on |	May 19 10:11:27
                                    Finished on |	May 19 22:49:15
       Mapping speed, Million of reads per hour |	49.78

                          Number of input reads |	628684248
                      Average input read length |	291
                                    UNIQUE READS:
                   Uniquely mapped reads number |	168192046
                        Uniquely mapped reads % |	26.75%
                          Average mapped length |	274.23
                       Number of splices: Total |	51522424
            Number of splices: Annotated (sjdb) |	9160977
                       Number of splices: GT/AG |	49391809
                       Number of splices: GC/AG |	268932
                       Number of splices: AT/AC |	17672
               Number of splices: Non-canonical |	1844011
                      Mismatch rate per base, % |	1.16%
                         Deletion rate per base |	0.17%
                        Deletion average length |	2.74
                        Insertion rate per base |	0.22%
                       Insertion average length |	3.23
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	10201122
             % of reads mapped to multiple loci |	1.62%
        Number of reads mapped to too many loci |	515333
             % of reads mapped to too many loci |	0.08%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	449406169
                 % of reads unmapped: too short |	71.48%
                Number of reads unmapped: other |	369578
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

GeneCounts
N_unmapped     450291080 450291080 450291080
N_multimapping  10201122  10201122  10201122
N_noFeature    141759082 154858765 155055338
N_ambiguous       963641    462265    465057
N_genecounts    25469323  12871016  12671651
total          628684248 628684248 628684248

4.1% unstranded mapping

Answer 14 · 2023-06-05T14:28:10.000Z

Ok, so update time. I've run both STAR using different settings as well as HISAT2 using default settings using both the GAWN and GENSAS gtf files. The STAR settings I used are as follows

    --outFilterScoreMinOverLread 0.33 \
    --outFilterMatchNminOverLread 0.33 \
    --outFilterMismatchNmax 2 \

I grabbed these from some searches on the internet, in some cases people went as far as settings the first two to 0.

STAR GAWN
37% overall mapping (33.88% unique, 3.10% multi)

STAR GENSAS
37% overall mapping (33.95% unique, 3.03% multi)

I used the default settings for HISAT2 and I used the hisat2_extract_exons.py and the hisat2_extract_splice_sites.py script to extract the exon and splice site info from the gtf files.

HISAT2 GAWN
Average mapping rate 23.64%

HISAT2 GENSAS
Average mapping rate 23.69%

Answer 15 · 2023-07-25T21:40:33.000Z

Decision is to use the GenSaS annotation file.