TOUA_Glucosinolate_GWAS

Script (and partial data) repository for Gloss et al. 2021 "Genome-wide association mapping within a single Arabidopsis thaliana population reveals a richer genetic architecture for defensive metabolite diversity"

Files included in repo so far:

Population Genomic Analysis

vcf_subset.s - subset 1001G and TOUA for snp comparison, future analyses
snp_comp.R - short R script demonstrating SNP comparisons used in paper
1001_admixture.s - generate pruned marker set, run ADMIXTURE, and extract snp windows (1001 Genomes)
- 1001_batch_admixture.s - array job script to run ADMIXTURE (K=1-15)
- 1001_pop_strat_windows.sh - extract pruned marker set for 100KB regions around GWAS-inferred loci
TOUA_admixture.s - generate pruned marker set, run ADMIXTURE, and extract snp windows (TOUA)
- TOUA_batch_admixture.s - array job script to run ADMIXTURE (K=1-15)
- TOUA_pop_strat_windows.sh - extract pruned marker set for 100KB regions around GWAS-inferred loci
1001allele_freq_boss.sh - runs several helper scripts to downsample alleles and calculate af (1001 Genomes)
- generate_1001_subset.R - R script which generates random sample of 1001G accessions in Europe (n = num in TOUA)
- 1001allele_freq_worker.s - array job script to run genAlleleFrequencies.R on vcf part
- genAlleleFrequencies.r - R script which samples from smaller allele pool, calculates af
TOUA_allele_freq_boss.sh - runs several helper scripts to downsample alleles and calculate af (TOUA)
- TOUA_allele_freq_worker.s - array job script to run genAlleleFrequencies.R on vcf part
- genAlleleFrequencies.r - R script which samples from smaller allele pool, calculates af
calculate_TD.s - calculates Tajima's D for downsampled alleles in 1001 and TOUA
- fake_genotypes_from_freqs.pl - creates vcf-like genotype data from downsampled allele frequency files
- 1001_header.txt - simplified vcf header for 1001 Genomes vcf
- TOUA_header.txt - simplified vcf header for TOUA vcf
geo_boss.sh - Script to run array, parallelize distance calculations between accessions in Katz et al.
- ed_calc_worker.s - array job script which runs part of geo_fig_batch.R
- geo_fig_batch.R - R script to calculate (part) of Haversine distances between accessions

Population Genomic Plots

geo_plot.R - R script to generate figure 1a - proportion of non-matching GSL chemotypes
alleleFreq_and_TD_plots.R - R script to generate figure 1b, 1c - allele frequency spectrum and Tajima's D
population_stratifcation_plots.R - R script to generate figure S7 - divergence in allele frequency...

Glucosinolate Profiling

functions_metaboliteQuantification.R - Defines and describes custom R functions, which utilize various functions within xcms and related packages, for the peak quantification pipeline implemented in downstream scripts
script_metaboliteQuantification.R - Implementation of the peak quantification pipeline
script_representativeEicPlot.R - Production of a representative EIC plot, included as Fig. S2a
script_compareSliceAndPeakPicking.R - Production of plots comparing xcms peak integration and custom “slice” integration, included as Supplementary Note Fig. 2a

Phenotypic Analyses and GWAS Prep (TOU-A)

script_TOU_heritability.R - Load raw phenotype data, fit LMMs with covariates to get heritability estimates and per-accession BLUPs for each glucosinolate molecule
script_TOU_prepForGemma.R - Reformat BLUPs to prep input phenotype file for downstream GWAS using GEMMA
script_TOU_BLUPs2ratios4Gemma.R - Transform BLUPs back to original (linear) scale phenotypes, then calculate log2(ratios) of precursor:product abundances and prep input phenotype file for downstream GWAS using GEMMA

Phenotypic Analyses and GWAS Prep (other panels: Brachi, Katz, Wu datasets)

script_<DatasetName>_parseGlucPhenotypes.R - parse raw data from the published dataset indicated by <DatasetName> (for all datasets), fit LMMs with covariates to get heritability estimates and per-accession BLUPs for each glucosinolate molecule (for Brachi and Katz only)
script_<DatasetName>_prepForGemma.R - Reformat BLUPs (Brachi, Katz) or “means” (i.e., the single technical replicate reported per sample by Wu, which can be considered a mean of all the biological replicates pooled within in) to prep input phenotype file for downstream GWAS using GEMMA
script_<DatasetName>_prepForGemmaRatios.R - Transform BLUPs back to original (linear) scale phenotypes, then calculate log2(ratios) of precursor:product abundances and prep input phenotype file for downstream GWAS using GEMMA

Running GWAS

script_GWAS_kinship.txt - Kinship calculated for GWAS using GEMMA
script_GWAS_lmm.txt - Single-trait GWAS using GEMMA (LMM)
script_GWAS_mvlmm.txt - Multi-trait GWAS using GEMMA (mvLMM)

Parsing / Analyzing GWAS

GS_OX1_plots.R - R script to generate Figure S6 - Associations at a minor variant near GS-OX1

*** ADDITIONAL SCRIPTS - UPLOAD IN PROGRESS ***

HABIOTECH/TOUA_Glucosinolate_GWAS