- Jon Moller (CARD, NIA, NIH, Bethesda, Maryland)
- Sam Stroupe (Texas A&M University)
- Sarah Fross (Texas A&M University)
- Shaghayegh Beheshti (Baylor College of Medicine)
- Pankhuri Wanjari (University of Chicago- Department of Pathology)
- Nha Van Huynh (University of Alabama at Birmingham)
Structural variants (SVs) represent deviations from a reference genome sequence, typically spanning more than 50 base pairs (bps). These variations can have significant implications for understanding genetic diversity and the mechanisms underlying various phenotypes. This project aims to develop a robust pipeline for detecting and cataloging identical SVs across different samples and databases, ultimately linking them to specific phenotypes.
Larger structural variants are present among human genomes. For example, human chromosomes can have:
- Missing segments (deletion variants)
- Duplicated segments (duplication variants)
- Inverted segments (inversion variants)
- Added segments (insertion variants)
- Segments transferred from other chromosomes (translocation variants)
Source: NIH Human Genomic Fact Sheet (Last updated: February 1, 2023)
The primary goal of this study is to identify and analyze SVs in novel and known genes, as well as established population SVs, to uncover new biological processes and associations. By cross-referencing SVs with phenotypic data, this pipeline seeks to establish a more comprehensive understanding of genotype-phenotype correlations.
- SV Detection: The pipeline will accurately identify SVs across multiple datasets, ensuring consistency and reliability in detecting known and novel variants.
- Phenotype Association: Each identified SV will be linked to phenotypic data, allowing for the correlation of specific genetic variations with particular traits or diseases.
- VCF File Output: The results will be condensed into a variant calling format (VCF) file, summarizing the detected SVs and their associated phenotypes. Users can then input a patient ID to retrieve potential phenotypic outcomes based on the identified SVs.
As an example, a previous study identified 11 SV loci associated with an increased risk for obesity, with an Odds Ratio exceeding 25% (DOI: 10.1371/journal.pone.0058048). This project aims to build upon such findings by extending the analysis to a broader set of SVs and phenotypes, facilitating the discovery of novel genetic contributors to complex traits.
We are validating our pipeline on the Project Adotto assembly-based variant calls from the GIAB tandem repeat benchmark (https://zenodo.org/records/6975244), beginning with SV calls in chromosome 1 (either insertions or deletions). Upon SV filtering steps, we went from 194,098 SVs to 55,905 SVs (remove those under 50 bp in length) and then 29,026 SVs (truvari collapse function keeping most common allele in each cluster).