This project summarize the data analysis for Roza2019 field. This is a project to identify molecular markers associanted with drought stress by GWAS and GS. The population was genotyped by high deep GBS (40X) and DArT. Breeding Insight (BI) uses the ID: Project # 11 and DAl22-7535.
-
Roza2019_GWAS.R is the main file wich runs GWASpoly analysis using a phenotypic matrix of
$400 \times 23$ and a genotypic matrix of$82,156 \times 501$ in GWASpoly format 'ACGT'. -
Install and run
updog
to check LD decay. -
Try SNP-based association to Gene-based association.
- SNV: Single nucleotide variant is a variation of a single nucleotide in a population's genome.
- SNP: Single nucleotide polimorphism is the same variation of a single nucleotide in a population's genome, but it is limeted to germline DNA and must be present in at leat 1% of the population. MAF filtering 0.01.
505 DNA samples were sent to GBS to University of Minnesota Genomic Center. 505 FASTQ files were retrieved. Metrics can be found in GBS/metrics.csv
. Six samples had low reads yield. Subsequent analysis were done with 499 files.
ID | fastp |
---|---|
338 | 102328 |
287 | 127897 |
327 | 140080 |
217 | 140974 |
204 | 162640 |
258.A07 | 293989 |
Please check file bash_scripts/4.1_ngsep_NMGS.sh
java -Xmx50g -Xms40g -jar ${NGSEP} MultisampleVariantsDetector -maxAlnsPerStartPos 100 -maxBaseQS 30 -ploidy 4 -psp -knownSTRs ${STR} -r ${GENOME} -o Roza2019_01.vcf
java -Xmx50g -Xms45g -jar ${NGSEP} VCFFilter -q 40 -s -fi -m 250 -minRD 8 -i Roza2019_01.vcf -o ../4_vcf_monoploid/Roza2019_02.vcf
Oct 11, 2023 2:58:08 PM ngsep.vcf.
VCFFilter logParameters
INFO: Input file: Roza2019_01.vcf
Output file: ../4_vcf_monoploid/ \ Roza2019_02.vcf
Genotype filters
Minimum genotype quality: 40 -q 40
Minimum read depth: 8 -minRD 8
Variant context filters
Population data filters
Minimum samples genotyped: 250 -m 250
Keep only biallelic SNVs -s
Filter sites where only one allele is observed in the population -fi
java -Xmx50g -Xms45g -jar ${NGSEP} VCFImpute -i Roza2019_02.vcf -o Roza2019_03
Oct 12, 2023 8:46:36 AM ngsep.variants.imputation.GenotypeImputer logParameters
INFO: Input file: Roza2019_02.vcf
Prefix for output files: Roza2019_03
Parents: []
Number of haplotype clusters: 8
Window size: 5000
Overlap: 50
Average centimorgans per Kbp: 0.001 \
Roza2019_03_imputed.vcf contains 499 taxa and 243,454 sites.
java -Xmx50g -Xms45g -jar ${NGSEP} VCFFilter -minMAF 0.05 -i Roza2019_03_imputed.vcf -o Roza2019_04.vcf
INFO: Input file: Roza2019_03_imputed.vcf
Output file: Roza2019_04.vcf
Genotype filters
Minimum genotype quality: 0
Minimum read depth: 0
Variant context filters
Population data filters
Minimum minor allele frequency (MAF): 0.05 \
500 taxa = 100% taxa, 25 taxa = 5% taxa. At least 25 taxa contain minor allele. Roza2019_04.vcf contains 499 taxa and 62,839 sites.
java -Xmx50g -Xms45g -jar ${NGSEP} VCFConverter -GWASPoly -i Roza2019_03_imputed.vcf -o Roza2019_04
Roza2019_05_GWASPoly.txt
contains 499 taxa
Roza2019_05_GWASPoly.txt in ACGT
format is converted in numeric format using the function atcg1234
from Sommer R package. Intersect between numeric Format of GWASpoly matrix and phenotypic matrix produce a vector of 424 taxa. 424 taxa
The function snp.pruning
from ASRgenomics R package allows to remove snps in LD to reduce redundant markers.
Mpr <- snp.pruning(M = geno.ps, pruning.thr = 0.90, window.n = 100, by.chrom = F, overlap.n = 10, seed = 1208, iterations = 20)
.
Pruned matrix contains 424 taxa
New files are saved in Box: Roza2019_06_GWASPoly.txt
is the file for GWASpoly and Roza2019_06_GS.txt
is the file for GS. Both files have 424 taxa
Table 1. Summary of VCF files generated during the process.
ID | File | Ind | Markers | Imputed | Annotated |
---|---|---|---|---|---|
1 | Roza2019_01.vcf | 499 | 1847214 | F | F |
2 | Roza2019_02.vcf | 499 | 243454 | F | F |
3 | Roza2019_03_imputed.vcf | 499 | 243454 | T | F |
4 | Roza2019_04.vcf | 499 | 62839 | T | F |
5 | Roza2019_06_GWASPoly.txt | 424 | 51081 | T | F |
6 | Roza2019_07.vcf | 499 | 396663 | F | F |
7 | Dif_02_07 (indels + ssr) | 499 | 153209 | F | F |
8 | Roza2019_05.vcf | 499 | 51081 | T | F |
9 | Roza2019_06.vcf | 499 | 51081 | T | T |
Yield was collected from 2019 to 2023 in 436 accessions (Families). Yield of three replicates was modelled using SpATS to obtain BLUEs by each env (harvest).
Table 1. Yield harvest collected in Roza2019. ID correspond to a unique harvest dataset. Env correspond to a harvest time. Harv correspond to consecutive harvest by year or season. Drought stress (DS) is increasing every season. May and June were harvests without stress (NS), and July, August, and September were harvests with DS. DS1 and DS 2 correspond to drought stress factor. H2 is broad sense heritability.
ID | Trait | Env | Month | Year | Date_YMD | Harv | DS1 | DS2 | H2 |
---|---|---|---|---|---|---|---|---|---|
1 | Yield | may_20 | May | 2020 | 2020-04-28 | 1 | 0 | NS | 0.27 |
2 | Yield | jun_20 | Jun | 2020 | 2020-06-02 | 2 | 0 | NS | 0.43 |
3 | Yield | jul_20 | Jul | 2020 | 2020-07-01 | 3 | 1 | DS | 0.24 |
4 | Yield | aug_20 | Aug | 2020 | 2020-08-05 | 4 | 2 | DS | 0.31 |
5 | Yield | sep_20 | Sep | 2020 | 2020-09-01 | 5 | 3 | DS | 0.35 |
6 | Yield | may_21 | May | 2021 | 2021-05-18 | 1 | 0 | NS | 0.41 |
7 | Yield | jun_21 | Jun | 2021 | 2021-06-16 | 2 | 0 | NS | 0.33 |
8 | Yield | jul_21 | Jul | 2021 | 2021-07-13 | 3 | 1 | DS | 0.38 |
9 | Yield | aug_21 | Aug | 2021 | 2021-08-10 | 4 | 2 | DS | 0.3 |
10 | Yield | sep_21 | Sep | 2021 | 2021-09-16 | 5 | 3 | DS | 0.19 |
11 | Yield | jun_22 | Jun | 2022 | 2022-05-23 | 1 | 0 | NS | 0.28 |
12 | Yield | jul_22 | Jul | 2022 | 2022-06-28 | 2 | 1 | DS | 0.25 |
13 | Yield | aug_22 | Aug | 2022 | 2022-08-01 | 3 | 2 | DS | 0.22 |
14 | Yield | sep_22 | Sep | 2022 | 2022-08-29 | 4 | 3 | DS | 0.35 |
15 | Yield | may_23 | May | 2023 | 2023-05-22 | 1 | 0 | NS | 0.22 |
16 | Yield | jun_23 | Jun | 2023 | 2023-06-20 | 2 | 1 | NS | 0.2 |
17 | Yield | jul_23 | Jul | 2023 | 2023-07-17 | 3 | 2 | DS | 0.23 |
18 | Yield | aug_23 | Aug | 2023 | 2023-08-24 | 4 | 3 | DS | 0.09 |
19 | Yield | total_20 | − | 2020 | − | − | − | − | 0.42 |
20 | Yield | total_21 | − | 2021 | − | − | − | − | 0.46 |
21 | Yield | total_22 | − | 2022 | − | − | − | − | 0.3 |
22 | Yield | total_23 | − | 2023 | − | − | − | − | 0.24 |
GWAS results with BLUEs from raw yield have suspicious (inflated) results in aug_22, jun_20, jun_22, jun_23, may_21, may_23, sep_20, jul_22.
Samples considered outliers by qqplot with BLUEs
aug_22 | jul_22 | may_21 | jun_22 | may_22 | order | Ntimes | Plant_ID |
---|---|---|---|---|---|---|---|
347 | 347 | 347 | 111 | 347 | 347 | 5 | 584 |
84 | 172 | 322 | 347 | 172 | 197 | 2 | 423 |
299 | 363 | 197 | 363 | 197 | 84 | 2 | 186 |
212 | 84 | 258 | 322 | 329 | 172 | 2 | 274 |
363 | 2 | 570 | |||||
322 | 2 | 515 |
The problem is, if I remove the samples considered outliers, there are many genotypes where all three reps are considered outliers, and I am removing important information of extreme geneotypes.
Roza2019 is a half-sib population with 436 alfalfa half-sib families each one contains at least 18 individuals which were developed by nested associated mapping (NAM).
Insert table of population.
Roza 2019 NAM population was constructed using a drought resistant parent population from WA and UT (1608) and a drought sensitive parent population from Guardsman II synthetic cultivar (1613). Guardsman II was planted in the greenhouse and after an evaluation protocol and selected 30 genotypes that were most susceptible to drought stress. Seed of F1 population (1622) were collected of 1613 parent and labeled A-F while seeds collected of 1608 parent were labeled G-L.
Harvest was collected from 2020 to 2023. Data modeling can be done using two stages approach or random regression model. Because this is an experiment with repeated measurements (longitudinal data), Random Regression Analysis is a good option.
Single | Trial | Stagewise | ||||
---|---|---|---|---|---|---|
Roza2019 | 2020 | 2021 | 2022 | gen_month | gen_year | gen_overall |
May | May | - | ST2_may | ST3_20 | ST4_Yi | |
Jun | Jun | Jun | ST2_jun | ST3_21 | - | |
Jul | Jul | Jul | ST2_jul | ST3_22 | - | |
Aug | Aug | Aug | ST2_aug | - | - | |
Sep | Sep | Sep | ST2_sep | - | - | |
Total | Total | Total | - | - | - |
Note: obtain percentage of each yield in total yield. Harvest 1 + Harvest 2 + Harvest 4 = 100% (Season total yield)