/killifish-RADseq-4popQTL

Repo for the analysis for QTL of PCB resistance in four populations of killifish

Primary LanguageR

killifish-RADseq-4popQTL

Methods

Background :

We created four QTL mapping families by breeding a female fish from each of four pollution-resistant populations with a male fish from a sensitive population. Embryos from F2 intercross families were challenged with PCB-126 exposure during development. Developing embryos were phenotyped for sensitivity by scoring the severity of PCB-induced cardiovascular system developmental defects, which are characteristic of PCB-126’s mechanism of toxicity. We genotyped the ~48 most-resistant and ~48 most-sensitive individuals from each family.

Additional genotypes: In addition to genome-wide genotyping with RAD-seq, we also sought to confirm genotypes at two discrete loci: at the loci that encode aryl hydrocarbon receptor 2a/1a (AHR2a/1a) and aryl hydrocarbon receptor interacting protein (AIP). These loci encode proteins that are core components of the signaling pathway that is adaptively de-sensitized in resistant fish (Whitehead et al. 2012), and that show among the strongest signatures of natural selection in resistant populations (Reid et al. 2016). Since the adaptive haplotypes are not fixed in resistant populations (Reid et al. 2016), we sought to confirm whether adaptive haplotypes were segregating in our mapping families. We genotyped all ELR offspring for a deletion that spans the last exon of AHR2a and the first six exons of neighboring AHR1a (83 kb deletion) which exists at 81% frequency in wild ELR fish but is absent in sensitive fish from a nearby reference population (methods described in (Reid et al. 2016)).

Genotyping:

We used RAD-seq (Miller et al. 2007) for genotyping individuals from F2 intercross families and mapping family founders. DNA was extracted from frozen embryos with a proteinase-K digestion and the Qiagen DNAeasy kit. Sequencing libraries were prepared for RAD-seq by ligating individual barcodes and NEBnext Illumina oligonucleotides to genomic DNA at SbfI cut sites. Four lanes of paired-end (PE-100) sequence data (Illumina HiSeq 2500) were collected (one lane per plate of 96 samples) to enable genotyping of offspring and founders at RAD sites. After de-multiplexing by barcode (perl script authored by Mike Miller, UC Davis) link and evaluatied the quality of each sample with FASTQC, reads were aligned to a linkage-mapped Fundulus heteroclitus assembly ((Miller et al. 2019); EBI BioStudies accession S-BSST163; this map orders scaffolds from the Fundulus heteroclitus reference genome assembly Fundulus_heteroclitus-3.0.2, NCBI BioProject PRJNA177717). We assigned read group information and aligned reads with BWA-MEM using default parameters, marked duplicates with SAMBLASTER, sorted reads with SAMTOOLS, and flagged improperly paired reads with BAMTOOLS (Barnett et al. 2011; Faust and Hall 2014).link. For each chromosome, Freebayes called genotypes on all mapping populations simultaneously (joint genotyping)(Garrison 2018) link. We clustered sample genotypes with metric MDS for sample ID QAQC and to cluster genotypes at chromosome 5 to assign sex to the immature embryos to include sex as a covariate in the QTL search and model fit (Figure 2).

plot

Figure 2. Metric MDS on un-filtered SNPs. Sample ID was checked by clustering individuals with metric MDS and PLINKS relationship/covariance Identity-by-state calculation engine on marker data. Colored dots represent embryos from F2 intercross families. Colored text represent progenitors of mapping families. Clustering is largely driven by sharing more common markers in the offspring, which causes grand-parents (founders) to not cluster as tightly even though they are related to offspring as much as they are to one another. One grandparent of ELR was an incorrectly barcoded offspring sample (purple dots).

Bi-alleleic SNPs, repetitive sequence alignments, family-specific invariant sites were filtered and the remaining loci were split families into separate datasets with PLINK 1.9 (Chang et al. 2015) link. Genotypes were formatted for each mapping family to be loaded as independent crosses for QTL analysis in R/qtl (Broman et al. 2003) web,git,cran. In R/qtl, offspring genotypes were filtered based on the number of genotypes per locus (>12.5% missing data) and the level of segregation distortion (p<0.001) from a 1:2:1 ratio following R/qtl recommendations (Broman and Sen 2009) link. Outcrosses in R/qtl are assumed to be from inbred founders, so our analysis was limited to markers that were homozygous for alternate alleles in the founders (double heterozygotes in F1 parents). Our segregation distortion filter was conservative because selective genotyping is expected to lead to some distortion, particularly at QTL of large effect (Xu 2008). After filtration, markers that were genotyped in the founders and F2 embryos were assigned to a consistent allele (A or B) within and among linkage groups. Any markers that could not be confirmed with founder genotypes were filtered to those that were clearly linked with other markers in the linkage group (formLinkageGroups in R/qtl with a recombination frequency < 0.1 and LOD score of 10). We initially anchored markers to their physical position along chromosomes, but then used recombination frequency to reorder (assign a different order than the initial meiotic map) markers and estimated mapping distance in this order with the Kosambi mapping algorithm (Kosambi 1943). Differences from the initial meiotic map may be the result of genome structural variation within a mapping population or read mapping and genome assembly errors. The final filtered marker set sufficiently covered the reference genome physical map and provided a sensible mapping order by visualizing the pairwise recombination frequency and LOD linkage matrix for each chromosome, as well as the relationship between genetic and physical distances along chromosomes (see supplement). After filtering, the number of markers remaining for mapping were: 1,763 for the NBH family, 3,361 for the BRP family, 6,690 for the NEW family, and 10,388 for the ELR family. The genotypes at the AHR2a/1a locus (presence/absence of a deletion) were added to the QTL analysis with RAD-Tag markers to test for linkage with other chromosomal markers and for an association with the resistant phenotype.

After filtering down to a manageable number of markers (rQTL stores all markers in memory), I performed most of the QTL analysis and model building in interactive R sessions link. Prior to interval mapping, we performed a marker regression on un-filtered markers to set the priors for exploring QTL model-space for interval mapping (Figure 3). We refined markers by iteratively visualizing LOD of markers, offspring genotypes, and physical position of snps for each population to ensure that we had informative markers along each chromosome. We then performed interval mapping on the filtered markers in R/qtl to test the single QTL model on each chromosome via multiple imputation (n=500) and Haley-Knott regression, which estimates the most likely interval genotypes and QTL position between RAD markers from the log posterior distribution. If the LOD score for the single QTL test exceeded our permuted threshold (top 15% of the null distribution) in a single QTL scan, the position of the highest LOD score was added to the full QTL model.

We compared multiple single and full-QTL models bewetween methods (normal, binary, transformed data with parametric and non-parametric models). examined models that included only the initial two major QTL discovered by marker regression (including the addition of the large effect QTL as cofactors for scanning for additional QTL), as well as more inclusive models that included the minor effect QTL in the full-model that were discovered by single QTL scans. QTL were dropped if they did not significantly change the fit of the full-model in the drop-one-qtl analysis. The full model in R/qtl estimated the additive and dominance effect of each of the QTL in the model. We compared their relative effect sizes to illuminate the genetic architecture of resistance in each of the mapping families. However, the full-model of selectively genotyped crosses would be expected to overestimate the proportion of variance explained by the QTL.

Selection Scan Data:

To further refine my QTL to candidate genes that contribute to resistance, we intersected previously identified signatures of selection from population genomics studies (Reid et al. 2016) with the wider QTL peaks identified here by aligning both datasets to the same chromosomal coordinates. This approach has been effective for connecting QTLs to strong selection from domestication and breeding (Rubin et al. 2012). Signatures of selection were identified from whole-genome scans of diversity and divergence between each of the four resistant populations and nearby non-resistant populations (not performed by me or code in this repo)(Reid et al. 2016). In the population genomics studies, genomic scaffolds were scanned for extreme values of pi, Fst, and Tajimas D in 5kb windows to identify signatures of strong recent natural selection. The selection signature thresholds for each statistic were set using a demographic model to simulate a null distribution of each statistic. These thresholds were used to convert the statistics into an aggregate z-score. The Z-scores (which included the width and magnitude of the signature region) formed a composite rank for each signature of selection in each polluted population. Fst, delta pi, and composite ranks on the scaffold level genome assembly were lifted over to a physical position on the killifish chromosomal assembly link (Miller et al. 2016, Fundulus_heteroclitus-3.0.2, NCBI BioProject PRJNA177717) with liftover using a chain-file generated by ALLMAPS (Hinrichs et al. 2006; Tang et al. 2015) link. QTL intervals were evaluated against the 50 and 200 top-ranked composite signatures of selection from (Reid et al. 2016).

References

Barnett, D. W., E. K. Garrison, A. R. Quinlan, M. P. Stromberg, and G. T. Marth. 2011. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27:1691–1692.

Broman, K. W., H. Wu, S. Sen, and G. A. Churchill. 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889–890.

Broman, K. W., and S. Sen. 2009. A guide to QTL mapping with R/qtl. Springer, Dordrecht.

Chang, C. C., C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell, and J. J. Lee. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4.

Clark, B. W., C. W. Matson, D. Jung, and R. T. Di Giulio. 2010. AHR2 mediates cardiac teratogenesis of polycyclic aromatic hydrocarbons and PCB-126 in Atlantic killifish (Fundulus heteroclitus). Aquat. Toxicol. 99:232–240.

Faust, G. G., and I. M. Hall. 2014. SAMBLASTER: Fast duplicate marking and structural variant read extraction. Pp. 2503–2505 in Bioinformatics.

Garrison, E. 2018. freebayes: Bayesian haplotype-based genetic polymorphism discovery and genotyping.

Kosambi, D. D. 1943. The estimation of map distances from recombination values. Annals of Eugenics 12:172–175.

Miller, J. T., N. M. Reid, D. E. Nacci, and A. Whitehead. 2019. Developing a High-Quality Linkage Map for the Atlantic Killifish Fundulus heteroclitus. G3 9:2851–2862.

Miller, M. R., J. P. Dunham, A. Amores, W. A. Cresko, and E. A. Johnson. 2007. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Research 17:240–248.

Oziolor, E. M., E. Bigorgne, L. Aguilar, S. Usenko, and C. W. Matson. 2014. Evolved resistance to PCB- and PAH-induced cardiac teratogenesis, and reduced CYP1A activity in Gulf killifish (Fundulus grandis) populations from the Houston Ship Channel, Texas. Aquat. Toxicol. 150:210–219.

Oziolor, E. M., N. M. Reid, S. Yair, K. M. Lee, S. Guberman VerPloeg, P. C. Bruns, J. R. Shaw, A. Whitehead, and C. W. Matson. 2019. Adaptive introgression enables evolutionary rescue from extreme environmental pollution. Science 364:455–457.

Reid, N. M., D. A. Proestou, B. W. Clark, W. C. Warren, J. K. Colbourne, J. R. Shaw, S. I. Karchner, M. E. Hahn, D. Nacci, M. F. Oleksiak, D. L. Crawford, and A. Whitehead. 2016. The genomic landscape of rapid repeated evolutionary adaptation to toxic pollution in wild fish. Science 354:1305–1308.

Rubin, C.-J., H.-J. Megens, A. M. Barrio, K. Maqbool, S. Sayyab, D. Schwochow, C. Wang, Ö. Carlborg, P. Jern, C. B. Jørgensen, A. L. Archibald, M. Fredholm, M. A. M. Groenen, and L. Andersson. 2012. Strong signatures of selection in the domestic pig genome. PNAS 109:19529–19536. National Academy of Sciences.

Tang, H., X. Zhang, C. Miao, J. Zhang, R. Ming, J. C. Schnable, P. S. Schnable, E. Lyons, and J. Lu. 2015. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 16:3.

Whitehead, A., D. A. Triant, D. Champlin, and D. Nacci. 2010. Comparative transcriptomics implicates mechanisms of evolved pollution tolerance in a killifish population. Molecular Ecology 19:5186–5203

Whitehead, A., W. Pilcher, D. Champlin, and D. Nacci. 2012. Common mechanism underlies repeated evolution of extreme pollution tolerance. P Roy Soc B-Biol Sci 279:427–433.

Xu, S. 2008. Quantitative Trait Locus Mapping Can Benefit From Segregation Distortion. Genetics 180:2201–2208. Genetics.