COMP383 Project Repository

Overview:

A Genome Wide Association Study (GWAS) is a tool used to analyze whether certain variants are associated with a specific trait of interest. Prior to performing the association tests, quality control procedures must be implemented in order to ensure accurate, reproducible, and unbiased results. GWAS often uses linear regression to determine associations, and this requires the use of unrelated individuals. Therefore, related individuals in the GWAS data can pose a problem by confounding the results of the study. Currently, the PLINK software provides bioinformaticians a way to conduct quality control checks on the initial GWAS data before it can be used in the study. Through the use of Python, we developed an automated pipeline that incorporated the major quality control relationship check PLINK commands that need to be performed on the GWAS data, as well as corresponding R plots. This program will allow related individuals to be identified using the PLINK software and visualized from the IBD plots. Individual threshold values for filtering and the choice of viewing plots will be left to user discretion.

Software Requirements:

Python

PLINK

Input Files

.bim/.bed/.fam files

Output Files

After Step 1: .sexcheck

After Step 2: .imiss/.lmiss

After Step 3: .hwe

After Step 4:.pune.in /.prune.out

After Step 5: .genome

Functions

The PPRA tool requires the user to have PLINK, Python, and R installed. The PPRA assumes a certain file structure. pyPipeUtil.py is a python program designed to assist the user in making in file directory system for storing datasets as well as deleting working and output files from the file structure. Since the PPRA uses relative paths to find the data input files, it must be run from the source directory. The PPRA accepts command line arguments; these arguments include the name of the datasets, an optional geno override value (default geno=0.1), and an optional filter parameter (default being not to filter). If no parameters are provided, a help message is displayed.

As the PPRA executes, the PLINK commands associated with each task print to the output and can be used diagnosing any issues uncovered during processing. If tasks 01 to 05 fail, the PPRA fails, and returns the PLINK exit code which caused the failure. Note the PPRA can be piped into a log file to examine any diagnostic messages generated by PLINK.

The PPRA is broken up into 6 functions which are:

Task01_CheckSex
Task02_Geno
Task03_HWE
Task04_Prune
Task05_Relationship
Task06_MakePlots

Task01_CheckSex runs the following PLINK command: ..\programs\plink.exe --bfile ..\data\dataset1\1Raw\dataset1 --check-sex --missing --out ..\data\dataset1\3Wip\dataset1-report

Task02_Geno runs the following PLINK commands: ..\programs\plink.exe --bfile ..\data\dataset1\1Raw\dataset1 --geno .01 --make-bed --out ..\data\dataset1\3Wip\dataset1-report.geno.01 ..\programs\plink.exe --bfile ..\data\dataset1\3Wip\dataset1-report.geno.01 --missing --out ..\data\dataset1\3Wip\dataset1-report.geno.01

Task03_HWE runs the following PLINK command: ..\programs\plink.exe --bfile ..\data\dataset1\3Wip\dataset1-report.geno.01 --hardy --out ..\data\dataset1\3Wip\dataset1-report.geno.01

Task04_Prune runs the following command: ..\programs\plink.exe --bfile ..\data\dataset1\3Wip\dataset1-report.geno.01 --indep-pairwise 50 5 0.3 --out ..\data\dataset1\3Wip\dataset1-report.geno.01

Task05_Relationship runs the following command: ..\programs\plink.exe --bfile ..\data\dataset1\3Wip\dataset1-report.geno.01 --extract ..\data\dataset1\3Wip\dataset1-report.geno.01.prune.in --genome --min 0.05 --out ..\data\dataset1\3Wip\dataset1-report.geno.01.rc

Task06_MakePlots invokes the PIPELINE_Scripts.r. This file, which can be edited by the user, can be used to generate any desired plots for the analysis of the dataset. The provided R scripts produce the following plots:

A histogram of lmiss$F_MISS

The .lmiss file that was created after the variant missing call rate command was conducted was used to create a histogram of the F_MISS values in R. The number of SNPs present in the file before filtering was also calculated.

A histogram of imiss$F_MISS

Similar to the .lmiss file, the .imiss file that was created after the sample missing call rate command was then input into R to create a histogram of the F_MISS values. The number of individuals present after filtering was also counted.

A histogram of postImiss$F_MISS

After the missing call rates under 99% were removed, the new .lmiss file was read into R. A histogram of the new F_MISS values was then made, and the number of SNPs present after the filtering was also calculated.

A histogram of hwe$P

The .hwe file was the output of the hardy weinberg check command in PLINK and contained the exact hardy weinberg test statistics. The p-values from this file were used to plot a histogram. The number of percentage of SNPs that contained a p-value under 1e-06 was generated.

An unfiltered IBD plot of the dataset
An accompanying histogram of pi-hat of the unfiltered IBD plot

After the relationship check PLINK command, the newly generated .genome file was input in R in order to create an ibd plot based on the Z0 and Z1 scores. The Z0 score represents the probability that the IBD is 0, whereas the Z1 score represents the probability that the IBD is 1. Related individuals can be visualized based on their location on the IBD plot. A histogram was also created using the PI_HAT (proportion of IBD) values.