GWAS Quality Control (QC)

A complementary tutorial to the (2022) Truong, Woerner, Cherlin, et al. Paper. [PDF]

Basic Overview • Set Up • Workflow Steps • Related Resources

Contributing

Contributions are welcome! Please check out the todos below and feel free to open a pull request.

Todo

✅ Locate Jakob's .genome file: ~/group/personal/jakob/gwas/cphg/affy/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_pruned10_genome.genome
✅ Upload all the steps for .genome file at the end of the tutorial
✅ Upload heterogeneity and missingness code
✅Upload scree plot / pca plot code added as an R script that you can run easily -- decided to just keep in-line code
✅ Upload plink_pca code
✅Repeat to clarify specific Plink versions -- Aude did this!
Repeat QC step 18 to confirm Plink versions for documentation (Van)
Add file with simulated phenotypes
Figure out if possible to generalize plink_pca script
Make a note of ideal imputed graph vs the tutorial imputed R-sq graph
Upload drop_relateds.sh but we need to adapt the dependencies / paths
Check for and fix any remaining hardcoded paths
Check with journal about paywall issues that every institution seems to be experiencing
✅Add Aude to github contributors

Basic Overview

We chose a publicly available dataset from the International Genome Sample Resource (IGSR) (www.internationalgenome.org). IGSR created and maintains the 1000 Genomes Project (1kGP) to provide a public catalog of common human genetic variation and genotype data. The 1kGP dataset has been kept up to date with current reference data sets, thus it is available for both GRCh37 and GRCh38. The latter is utilized here because the 2014 update increased the quantity of loci represented, resolved more than 1000 issues from the previous version of the assembly, and overall provides a better basis for alignment and subsequent analysis. Additionally, IGSR’s continued efforts will lead to the incorporation of various populations to the data which were not previously captured.

Set Up

Set up working directory

👇 How to set up your directory structure for the GWAS QC workflow

# use -p option to create multiple directories at once
mkdir -p GWAS_QC/rawData GWAS_QC/preImputation GWAS_QC/preImputation/VCFfiles GWAS_QC/Imputed GWAS_QC/postImpuatation

.
├── GWAS_QC/ (current directory)
    ├── rawData/
    ├── preImputation/
    |    └── VCFfiles/
    ├── Imputed/
    └── postImputation/

Modules in BASH

👇 How to load modules

module load plink/1.9-20210416
module load bcftools/1.9
module load vcftools/0.1.12c # used by bcftools so may not need to be directly loaded
module load tabix/0.2.6 # this one might not work -- let me know if it's a problem use htslib instead (below)
module load htslib
module load perl
module load liftOver/20180423
module load R

Notes on PLINK v1.9 and v2.0

HEREIN, ALL USAGE OF PLINK 1.9 IS INDICATED BY "plink" AND PLINK 2.0 BY "plink2"
Not all commands are portable between PLINK version 1.9 and version 2.0. Since PLINK v2.0 is under heavy active development, the developers urge users to check certain results against an earlier, more widely-used version of PLINK. Some functions are available in v1.9 which are not in v2.0, and vice versa. Some of the same functions will produce different file formats and outputs as well.
Original version of PLINK: 1.07, https://zzz.bwh.harvard.edu/plink/plink2.shtml
Beta version: 1.90, https://www.cog-genomics.org/plink/1.9/
Alpha version: 2.00, https://www.cog-genomics.org/plink/2.0/

👇 PLINK developers' comments on 1.9 and 2.0 serving as complementary resources

The main difference is that plink 1.9 is essentially finished, while plink 2.0 is an alpha-stage program which will have significant unfinished components for a while to come. As a consequence, current development priorities for plink 2.0 are centered around things which are impossible to do with plink 1.9, such as handling multiallelic/phased variants and dosage data and reliably tracking REF/ALT alleles; while things that plink 1.9 already handles perfectly well, such as working with .ped/.map file pairs, have been deliberately deprioritized for now. So, you should stick to 1.9 as long as it's good enough for the jobs you need to perform. But once you need to do something outside 1.9's scope, you're likely to find that 2.0 already has the additional feature you need (or it'll be added quickly after you ask for it). https://www.biostars.org/p/299855/

Workflow Steps

PART 1 -- Get Data

Step 1 - Retrieve 1000 Genomes Data

👇 Steps and code

Download -- First, we need to download the publicly available dataset from the 1000 Genomes Project (1KGP). The data is Affy6.0 genotype data for 3,450 individuals with population-level data.
Files can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/
Two download options
- Download directly to your local computer by clicking the hyperlink for ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf.gz and move it to your working directory
- OR Dowload using wget command, for example:
Enter your GWAS_QC directory

cd GWAS_QC/rawData/

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf.gz

At this point ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf.gz should be in your GWAS_QC/rawData/ directory.

gunzip ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf.gz

Then recode the .vcf file as .bed, .bim, .bed.

plink --vcf ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped

When we check the directory, the following files should be unzipped:

ls

ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bed ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bim ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.fam ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.log ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf

Download the 1000 Genomes 20130606_g1k.ped
The files contained in this directory contain information about all the samples in the 1000 genomes project and their relations.

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

ls

ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bed ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bim ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.fam ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.log ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf 20130606_g1k.ped

Make and enter a PCA directory

mkdir PCA
cd PCA

Retrieve the following files from 1000 Genomes

wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
wget -r http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

Step 2 - Download other relevant data

👇 Steps and code

You will also want to dowload the pedigree information, which can be downloaded here:
- https://www.internationalgenome.org/faq/can-i-get-phenotype-gender-and-family-relationship-information-for-the-individuals/
- 20130606_g1k.ped

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

You also need to dowload the TOPMed reference panel and code from the McCarthy Tools website
- https://www.well.ox.ac.uk/~wrayner/tools/
- Midway down the page there are instructions for Usage with the TOPMed reference panel
- Download CreateTOPMed.zip from the McCarthy Tools website
- Click on the link to navigate to the Bravo site where ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz can be downloaded from
- Click on dbSNP in the top right panel and click "Download VCF" button to download ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz file
- Note: If you run the curl command that's given on the site, the filename will be different
- After downloading both files, move them to the GWAS_QC/rawData/ directory, and unzip the CreateTOPMed.zip file
- Then execute the following command to convert VCF to an HRC formatted reference legend according to the code provided by McCarthy Tools:
```
 ./CreateTOPMed.pl -i ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz
```
- If the shabang "./" doesn't work, you may need to run the command with "perl" instead
- By default this will create a file filtered for variants flagged as PASS only, if you wish to use all variants the -a flag overrides this. To override the default output file naming use -o filename.
- If you get an error in the above step, try this variation. Both were run successfully on a local computer and server using perl/5.30.0. Depending on your setup, this may take a few hours to run:
```
 perl CreateTOPMed.pl -i ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz	
```
- Since it takes awhile to run, you may want to submit a job to the queue if you're working on a shared cluster

This is how the file system should look at the end of Step 2

.
├── GWAS_QC/
    ├── rawData/ (current directory)
    |    ├── 20130606_g1k.ped
    |    ├── ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bed
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bim
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.fam
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.log
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf
    |    ├── CreateTOPMed.pl
    |    ├── CreateTOPMed.zip
    |    ├── LICENSE.txt
    |    └── PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab.gz
    ├──preImputation/
    |    └── VCFfiles/
    ├──Imputed/
    └── postImputation/

PART 2 -- Pre-Imputation

Step 1 - Enter the preImputation directory

👇 Steps and code

cd preImputation

Step 2 - Check heterogeneity and missingness

👇 Steps and code

First, run plink commands to calculate heterogeneity and missingness for the data

To calculate heterogeneity, you can use the --het command in Plink. The command would look like this: plink --bfile [file_name] --het

This will calculate the heterozygosity for each individual in the dataset and write the results to a file called "[file_name].het".

module load plink/1.9
plink --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped --het --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het

Output:

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.log.
Options in effect:
  --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped
  --het
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het

128235 MB RAM detected; reserving 64117 MB for main workspace.
905788 variants loaded from .bim file.
3450 people (0 males, 0 females, 3450 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3450 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996712.
905788 variants and 3450 people pass filters and QC.
Note: No phenotypes present.
--het: 868601 variants scanned, report written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.het .

To calculate missingness, you can use the --missing command in Plink. The command would look like this: plink --bfile [file_name] --missing

This will calculate the proportion of missing genotypes for each individual in the dataset and write the results to a file called "[file_name].lmiss"

You can also use the --missing2 command which will report the frequency of missing genotypes at each variant, the command is: plink --bfile [file_name] --missing2

This will give a file with extension .lmiss2 which will have frequency of missing genotypes for each variant

plink --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped --missing --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss

Output:

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.log.
Options in effect:
  --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped
  --het
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het

128235 MB RAM detected; reserving 64117 MB for main workspace.
905788 variants loaded from .bim file.
> 3450 people (0 males, 0 females, 3450 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3450 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996712.
905788 variants and 3450 people pass filters and QC.
Note: No phenotypes present.
--het: 868601 variants scanned, report written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.het .
	PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3

Here's the current directory structure within GWAS_QC/preImputation/

ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.het ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.log ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.log ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.imiss ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.lmiss

.
├── GWAS_QC/
    ├── rawData/
    |    ├── 20130606_g1k.ped
    |    ├── ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bed
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.bim
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.fam
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.log
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf
    |    ├── CreateTOPMed.pl
    |    ├── CreateTOPMed.zip
    |    ├── LICENSE.txt
    |    └── PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab.gz
    ├──preImputation/ (current directory)
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.het
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.log
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.nosex
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.imiss
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.lmiss
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.log
    |    ├── ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.nosex
    |    └── sex_file.txt
    ├──Imputed/
    └── postImputation/

Plot heterogeneity vs. missingness
Needed:
- Code_Heterogeneity_Missingness.R
- ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_miss.imiss
- ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_het.het
Input file: ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped
Output file: ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_HM_plot.pdf

Rscript Code_Heterogeneity_Missingness.R ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_HM_plot.pdf

Figure 2 uses the same code as above, on example internal data

Step 3 - Update first column of file which has zeros

👇 Steps and code

cat ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.fam | awk '{print $1,$2,$2,$2}' > ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_toUpdate.txt

plink --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped --update-ids ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_toUpdate.txt --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.log.
Options in effect:
  --bfile ../rawData/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped
  --make-bed
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated
  --update-ids ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_toUpdate.txt

193057 MB RAM detected; reserving 96528 MB for main workspace.
905788 variants loaded from .bim file.
3450 people (0 males, 0 females, 3450 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.nosex .
--update-ids: 3450 people updated.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3450 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996712.
905788 variants and 3450 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.bed +
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.bim +
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.fam ... done.

Step 4 -- Add sex phenotype

👇 Steps and code

First make sex file from 20130606_g1k.ped file

join -1 1 -2 1 <(cat ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.fam |sort -k1,1) <(cat ../rawData/20130606_g1k.ped |awk -F '\t' '{print $2,$5}' |sort -k1,1) |awk '{print $1,$2,$7}' > sex_file.txt

Then update .fam file to include sex phenotype

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated --update-sex sex_file.txt --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated
  --make-bed
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex
  --update-sex sex_file.txt

193057 MB RAM detected; reserving 96528 MB for main workspace.
905788 variants loaded from .bim file.
3450 people (0 males, 0 females, 3450 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.nosex .
--update-sex: 3450 people updated.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3450 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 219657 het. haploid genotypes present (see
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.hh );
many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996711.
905788 variants and 3450 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.bed +
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.bim +
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.fam ...
done.

Next, check sex and remove sex inconsistencies

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex --check-sex

plink.sexcheck_listPLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex
  --check-sex
128235 MB RAM detected; reserving 64117 MB for main workspace.
905788 variants loaded from .bim file.
3450 people (1715 males, 1735 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3450 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 219657 het. haploid genotypes present (see plink.hh ); many commands
treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996711.
905788 variants and 3450 people pass filters and QC.
Note: No phenotypes present.
--check-sex: 36449 Xchr and 0 Ychr variant(s) scanned, 577 problems detected.
Report written to plink.sexcheck .

cat plink.sexcheck |  grep PROBLEM | sed 's/^ *//' > plink.sexcheck_list

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex --remove plink.sexcheck_list --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex
  --make-bed
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked
  --remove plink.sexcheck_list

193057 MB RAM detected; reserving 96528 MB for main workspace.
905788 variants loaded from .bim file.
3450 people (1715 males, 1735 females) loaded from .fam.
--remove: 2873 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 209918 het. haploid genotypes present (see
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.996672.
905788 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.bed
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.bim
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.fam
... done.

Step 5 -- Remove SNP variants that do not have SNP IDs

👇 Steps and code

echo . > noSNP
plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked --exclude noSNP --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked
  --exclude noSNP
  --make-bed
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots

193057 MB RAM detected; reserving 96528 MB for main workspace.
905788 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
--exclude: 887969 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 209889 het. haploid genotypes present (see
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
> Total genotyping rate is 0.996683.
887969 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.bed
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.bim
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.fam
... done.

Step 6 -- Run QC on data

👇 Steps and code

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots --geno 0.05 --mind 0.1 --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots
  --geno 0.05
  --make-bed
  --mind 0.1
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC

193057 MB RAM detected; reserving 96528 MB for main workspace.
887969 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 209889 het. haploid genotypes present (see
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.996683.
7979 variants removed due to missing genotype data (--geno).
879990 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.bed
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.bim
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.fam
... done.

head ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.bim

1       rs10458597      0       564621  T       C
1       rs12565286      0       721290  C       G
1       rs12082473      0       740857  A       G
1       rs3094315       0       752566  A       G
1       rs2286139       0       761732  T       C
1       rs11240776      0       765269  G       A
1       rs2980319       0       777122  T       A
1       rs2980300       0       785989  C       T
1       rs2905036       0       792480  T       C
1       rs11240777      0       798959  A       G

Step 7 -- LiftOver the data

👇 Steps and code

The TOPMed Imputation Server can accept either hg37 or hg38 so this liftOver step is redundant.
However, we included this step because often times we need to use hg38 for our data.
Please use your best judgement to determine if liftOver is right for you.
Let's proceed:
- First, make BED coordinate file

cat ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.bim | awk '{print "chr"$1, $4, ($4+1), $4, $2}' > liftover_input.bed

Check output

head liftover_input.bed

chr1 564621 564622 564621 rs10458597
chr1 721290 721291 721290 rs12565286
chr1 740857 740858 740857 rs12082473
chr1 752566 752567 752566 rs3094315
chr1 761732 761733 761732 rs2286139
chr1 765269 765270 765269 rs11240776
chr1 777122 777123 777122 rs2980319
chr1 785989 785990 785989 rs2980300
chr1 792480 792481 792480 rs2905036
chr1 798959 798960 798959 rs11240777

sed -i 's/chr23/chrX/g' liftover_input.bed #I don't think this is a problem with this dataset

Then, download the correct liftover file
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/
- dowload hg19ToHg38.over.chain.gz to your local computer
- If doing analysis on server, transfer file to working directory using scp
- unzip the data
Transfer file to working directory on server

scp /home/directory/path/hg19ToHg38.over.chain.gz /server/directory/path/

Unzip file

gunzip 	hg19ToHg38.over.chain.gz

Finally, perform the actual liftOver

liftOver liftover_input.bed hg19ToHg38.over.chain liftover_newmap.txt liftover_exclude.txt

You will end up with two output files liftover_newmap.txt and liftover_exclude.txt
liftover_newmap.txt will contain the coordinates of each SNP in assembly GRCh38 after the liftover in column 2

head liftover_newmap.txt

chr1    629241  629242  564621  rs10458597
chr1    785910  785911  721290  rs12565286
chr1    805477  805478  740857  rs12082473
chr1    817186  817187  752566  rs3094315
chr1    826352  826353  761732  rs2286139
chr1    829889  829890  765269  rs11240776
chr1    841742  841743  777122  rs2980319
chr1    850609  850610  785989  rs2980300
chr1    857100  857101  792480  rs2905036
chr1    863579  863580  798959  rs11240777

To double check that this worked correctly take the first SNP rs10458597 and look it up in dbSNP https://www.ncbi.nlm.nih.gov/snp/

Some formatting is required for both files before moving on to Step 8

sed -i 's/chr//g' liftover_newmap.txt
awk '{print $5,$2}' liftover_newmap.txt > update_map.txt
cat liftover_exclude.txt | grep -v "#" | awk '{print $5}' > exclude_liftover.txt

You should see that SNP rs10458597 is in position 629241 in assembly GRCh38 but position 564621 in assembly GRCh37
liftover_exclude.txt will contain the SNPs that were not able to be aligned during the liftOver process

head exclude_liftover.txt

Step 8 -- Exclude data

👇 Steps and code

Exclude any SNPs that do not liftOver and non-somatic chromosomes (X, Y)

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC --exclude exclude_liftover.txt --update-map update_map.txt --not-chr X, Y --make-bed --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC
  --exclude exclude_liftover.txt
  --make-bed
  --not-chr X, Y
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38
  --update-map update_map.txt

193057 MB RAM detected; reserving 96528 MB for main workspace.
843232 out of 879990 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
--update-map: 834872 values updated, 36408 variant IDs not present.
Warning: Base-pair positions are now unsorted!
--exclude: 834872 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997613.
834872 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bed
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bim
+
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.fam
... done.

Final pre-imputation variant count is 834,872 SNPs

wc -l <filename>

905788 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated.bim

905788 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex.bim

905788 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked.bim

887969 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots.bim

879990 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC.bim

834872 ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bim

Step 9 -- Principal Component Analysis (PCA)

👇 Steps and code

First, run PCA on data using the plink_pca.sh script

module load plink_pca
plink_pca.sh -b ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38 -g PCA_no1KG

...(some other printed output above this too)
128235 MB RAM detected; reserving 64117 MB for main workspace.
471514 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
Using 1 thread.
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997655.
471514 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
Pruned 19116 variants from chromosome 1, leaving 18784.
Pruned 19194 variants from chromosome 2, leaving 19285.
Pruned 16293 variants from chromosome 3, leaving 16882.
Pruned 15068 variants from chromosome 4, leaving 15207.
Pruned 15643 variants from chromosome 5, leaving 15452.
Pruned 16122 variants from chromosome 6, leaving 15251.
Pruned 13104 variants from chromosome 7, leaving 13343.
Pruned 12860 variants from chromosome 8, leaving 13062.
Pruned 11610 variants from chromosome 9, leaving 11571.
Pruned 13056 variants from chromosome 10, leaving 13071.
Pruned 12506 variants from chromosome 11, leaving 11918.
Pruned 11280 variants from chromosome 12, leaving 11950.
Pruned 9480 variants from chromosome 13, leaving 9483.
Pruned 7267 variants from chromosome 14, leaving 8155.
Pruned 6011 variants from chromosome 15, leaving 7515.
Pruned 6407 variants from chromosome 16, leaving 8114.
Pruned 4564 variants from chromosome 17, leaving 6281.
Pruned 6978 variants from chromosome 18, leaving 7804.
Pruned 2692 variants from chromosome 19, leaving 3985.
Pruned 5791 variants from chromosome 20, leaving 6533.
Pruned 3195 variants from chromosome 21, leaving 3735.
Pruned 2418 variants from chromosome 22, leaving 3478.
Pruning complete.  230655 of 471514 variants removed.
Marker lists written to
plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.ld_list.prune.in
and
plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.ld_list.prune.out
.
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.log.
Options in effect:
  --bfile preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38
  --extract plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.ld_list.prune.in
  --make-bed
  --out plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38
128235 MB RAM detected; reserving 64117 MB for main workspace.
471514 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
--extract: 240859 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.99744.
240859 variants and 2873 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bed
+
plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bim
+
plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.fam
... done.
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.nosex’: > No such file or directory
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.*.extracted.*’: No such file or directory
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bed’: No such file or directory
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bim’: No such file or directory
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.fam’: No such file or directory
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.log’: No such file or directory
par_f=/tmp/tmp.2OQ8Ahb50A
running smartpca...
*> *> *Registered S3 methods overwritten by 'tibble':
  method     from
  format.tbl pillar
  print.tbl  pillar
Created Scree plot file plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_ScreePlot.png
Created plot data file plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_variance.txt
rm: cannot remove ‘preld.plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.*’: No such file or directory

The command should have made an output of 5 files:

ls plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.geno*

plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.eval

plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.evec

plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.excluded

plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_ScreePlot.png

plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_variance.txt

Then, create Scree and PCA plots in R

library(cowplot); library(tidyverse)

pca <- read_table("plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.evec", col_names = FALSE)
pve <- read_table("plink_pca.ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_variance.txt")
names(pve) <- c("PC","Variance")
ped <- read.csv("20130606_g1k.ped", sep = "\t")

scree_plot <- ggplot(pve, aes(PC, Variance)) + 
  geom_line() + 
  geom_point(color = "black", size = 3) +
  labs(x = "Principal Component", y = "Proportion variance explained") + 
  theme_light() +
  theme(axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 14),
        axis.text = element_text(size = 14)) +
  scale_x_continuous(breaks=c(1,5,10,15,20))

pca_one_two <- pca[-1,1:3]
names(pca_one_two)  <- c("Individual.ID", "PC1","PC2")

pc_plot <- pca_one_two %>% 
  left_join(ped %>% 
              select(Individual.ID, Population), 
            by = "Individual.ID") %>% 
  # Collapse to Superpopulations
  mutate(Population = fct_collapse(Population, 
                                   AFR = c("YRI","LWK","GWD","MSL","ESN"),
                                   AMR = c("ASW","ACB","MXL","PUR","CLM","PEL"),
                                   EAS = c("CHB","JPT","CHS","CDX","KHV"), #,"CHD"), # Denver Chinese not in our data set
                                   EUR = c("CEU","TSI","GBR","FIN","IBS"),
                                   SAS = c("GIH","PJL","BEB","STU","ITU")),
         Population = factor(Population, levels = c("AFR","AMR","EAS","EUR","SAS"))) %>% 
  ggplot(aes(PC1, PC2, color = Population)) +
  geom_point(alpha = 0.8) + 
  theme_light() +
  theme(axis.text.y = element_text(angle = 90, size = 13, hjust = 0.5),
        axis.text.x = element_text(size = 13),
        axis.title = element_text(size = 15),
        legend.title = element_text(size = 15),
        legend.text = element_text(size = 12)) +
  scale_color_manual(values = palette.colors(palette = "set1"))

# Figure dimensions: 7x7
plot_grid(scree_plot,
          pc_plot, ncol = 1,
          labels = "AUTO", label_size = 17, 
          label_x = -0.005, label_y = 1.02)

This generates a plot similar to Figure 6 from the paper

Step 10 -- Calculate frequency files and compare to TOPMed panel

👇 Steps and code

First, calculate frequencies
- Note: This needs to be done using Plink 1.9. Check Plink version by typing "plink" and "plink2" on the command line.

plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38 --freq --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq

PLINK v1.90p 64-bit (16 Apr 2021)              www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq.log.
Options in effect:
  --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38
  --freq
  --out ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq

128235 MB RAM detected; reserving 64117 MB for main workspace.
843232 variants loaded from .bim file.
2873 people (1713 males, 1160 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2873 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997609.
--freq: Allele frequencies (founders only) written to
ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq.frq
.

Examine the output

head ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq.frq

  CHR          SNP   A1   A2          MAF  NCHROBS
    1   rs10458597    T    C      0.00733     5730
    1   rs12565286    C    G      0.03481     5716
    1   rs12082473    A    G      0.07056     5726
    1    rs3094315    G    A       0.3353     5732
    1    rs2286139    C    T       0.4201     5732
    1   rs11240776    G    A     0.008188     5740
    1    rs2980319    A    T       0.2963     5740
    1    rs2980300    T    C       0.4287     5678
    1    rs2905036    C    T      0.03063     5746

Then, compare variants to TOPMED panel
- The perl script is created by the Wayner Tools group: https://www.well.ox.ac.uk/~wrayner/tools/
- Make sure you've downloaded the following file: HRC-1000G-check-bim-v4.3.0.zip
- Unzip and add the HRC-1000G-check-bim.pl script to your rawData/ directory
- Remember that we already have PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab in rawData/ as well

perl ../rawData/HRC-1000G-check-bim.pl -r ../rawData/PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab -h -b ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38.bim -f ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38_freq.frq

Example output:

 Matching to HRC
 
 Position Matches
  ID matches HRC 0
  ID Doesn't match HRC 832317
  Total Position Matches 832317
 ID Match
  Position different from HRC 0
 No Match to HRC 10845
 Skipped (MT) 0
 Total in bim file 843162
 Total processed 843162 

  ID matches HRC 0
  ID Doesn't match HRC 832317
  Total Position Matches 832317
 ID Match
  Position different from HRC 0
 No Match to HRC 10845
 Skipped (MT) 0
 Total in bim file 843162
 Total processed 843162

 Indels 0

 SNPs not changed 1686
 SNPs to change ref alt 795615
 Strand ok 797298
 Total Strand ok 797301
 
 Strand to change 10
 Total checked 832317
 Total checked Strand 797308
 Total removed for allele Frequency diff > 0.2 163426
 Palindromic SNPs with Freq > 0.4 19737

 Non Matching alleles 15272
 ID and allele mismatching 15272; where HRC is . 0
 Duplicates removed 0

Run the Run-plink.sh script, which is generate by perl script
- Pulls out chromosome info and makes VCF files

sed -i 's/--recode vcf/--recode vcf --output-chr chrM/g' Run-plink.sh
chmod +x ./Run-plink.sh
./Run-plink.sh

Example output:

Start time: Tue Oct 11 17:18:24 2022
128235 MiB RAM detected; reserving 64117 MiB for main workspace.
Using up to 56 threads (change this with --threads).
2873 samples (1160 females, 1713 males; 2873 founders) loaded from
/project/ritchie02/projects/cphg-gwas-qc/GWAS_QC/preImputation/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated.fam.
236 out of 21694 variants loaded from
/project/ritchie02/projects/cphg-gwas-qc/GWAS_QC/preImputation/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated.bim.
Note: No phenotype data present.
Writing
/project/ritchie02/projects/cphg-gwas-qc/GWAS_QC/preImputation/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated-chr22.fam
... done.
Writing
/project/ritchie02/projects/cphg-gwas-qc/GWAS_QC/preImputation/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated-chr22.bim
... done.
Writing
/project/ritchie02/projects/cphg-gwas-qc/GWAS_QC/preImputation/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated-chr22.bed
... done.
End time: Tue Oct 11 17:18:24 2022
PLINK v2.00a3LM 64-bit Intel (5 May 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Error: Duplicate --output-chr flag.

Step 11 - Sort and zip files to create VCF files for imputation

👇 Steps and code

Flip files

export BCFTOOLS_PLUGINS=/appl/bcftools-1.9/libexec/bcftools/

for i in {1..22}; do bcftools +fixref ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated-chr$i'.vcf' -Ov -o ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated_flipped_chr$i'.vcf' -- -d -f ~/group/projects/PMBB/QC_Imputation/scripts/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -m flip; done

Example Output:

...repeated multiple times...
# SC, guessed strand convention
SC      TOP-compatible  0
SC      BOT-compatible  0
# ST, substitution types
ST      A>C     2       0.8%
ST      A>G     28      11.9%
ST      A>T     1       0.4%
ST      C>A     3       1.3%
ST      C>G     3       1.3%
ST      C>T     81      34.3%
ST      G>A     73      30.9%
ST      G>C     3       1.3%
ST      G>T     5       2.1%
ST      T>A     0       0.0%
ST      T>C     35      14.8%
ST      T>G     2       0.8%
# NS, Number of sites:
NS      total           236
NS      ref match       236     100.0%
NS      ref mismatch    0       0.0%
NS      flipped         0       0.0%
NS      swapped         0       0.0%
NS      flip+swap       0       0.0%
NS      unresolved      7       3.0%
NS      fixed pos       0       0.0%
NS      skipped         0
NS      non-ACGT        0
NS      non-SNP         0
NS      non-biallelic   0

Sort VCF and zip

for i in {1..22}; do vcf-sort ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped_Updated_withsex_checked_noDots_QC_b38-updated_flipped_chr$i.vcf | bgzip -c > VCFfiles/ALL.wgs.nhgri_coriell_affy_6.20140825_ImputationInput_TOPMED_chr$i.vcf.gz; done

Step 12 -- If not already on local computer, copy VCF files to local computer in order to upload to TOPMed impuatation server

👇 Steps and code

The scp (secure copy) command can transfer files across hosts. Since we want to transfer an entire directory, you need to include the -r flag before executing

scp -r login@serveraddress:/~/GWAS_QC/VCFfiles/ ~/Desktop/

Step 13 -- Last Pre-Imputation step to calculate relateds which will be needed later for GWAS

👇 Steps and code

* First, apply a number of QC / prunning methods to data

# Get sex inconsistent samples (from raw)
plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped --check-sex --out prefix_checksex

awk '{ if ($4 == 0) { print $2"\t"$2} }' prefix_checksex.sexcheck > nosex.txt

# Remove sex inconsistent samples (from raw)
plink --bfile ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped --remove nosex.txt --make-bed --out prefix_nosex

# Call rate (95% snp call rate / 90% sample call rate)
plink --bfile prefix_nosex --geno 0.05 --mind 0.1 --make-bed --out prefix_call

# MAF
plink --bfile prefix_call --maf 0.05 --make-bed --out prefix_maf

# Pruning down to 99,999
plink --bfile prefix_maf --indep-pairwise 50 5 0.12831 --out prune_100k

# Keep those samples (Around 97k are autosomal)
plink --bfile prefix_maf --extract prune_100k.prune.in --make-bed --out prefix_pruned

# Make IBD file (-i in drop_relateds.sh)
plink --bfile prefix_pruned --genome --out prefix_pruned_100k_genome


cat  prefix_pruned_100k_genome.genome | awk '{print $2,$2,$4,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14}' > prefix_pruned_100k_genome_updated.genome

Manually change the header so it is FID1 IID1 FID2 IID2
- Steps
- i to initiate insert mode
- Delete headers 1-4
- Replace with FID1, IID1, FID2, IID2 (Space separated)
- esc to exit from the insert mode
- ZZ to close and save the file

vi prefix_pruned_100k_genome_updated.genome_updated.genome

PART 3 -- Genotype Imputation

Step 14 -- Imputation using TOPMed Imputation Server

Several tools exist specifically for genotype imputation such as the Michigan and Trans-Omics for Precision Medicine (TOPMed) Imputation Servers where one uploads the phased or unphased GWAS genotypes in order to receive the imputed genomes in return. Each imputation server varies in terms of speed and accuracy. One of the most important considerations in imputation is the composition of the reference panel. For our study, we selected the TOPMed Imputation Reference panel (version r2) because it is one of the most diverse reference panels available and contains information from 97,256 deeply sequenced human genomes containing 308,107085 genetic variants distributed across the 22 autosomes and the X chromosome.

👇 Steps for form entry

👇 TOPMed Server Output

Click on the "Jobs" tab at the top to view your current jobs
Select the relevant job run to view your results and download the completed data
The data will expire within a certain timeframe
There will be sections for: Input Validation, Quality Control, Quality Control (Report), Pre-phasing and Imputation, and Data Compression and Encryption

Step 15 -- Download Imputed Data to Working Directory

👇 Steps and code

Download the completed imputation files using the wget commands provided by TOPMed to the location where you'll be working with it
Change to the appropriate location

cd GWAS_QC/Imputed/

Copy imputed files to your working directory from the TOPMed Imputation Server. The output will be a curl command that you can copy & paste to run in your active directory. Below is an example:

curl -sL <URL> | bash

Unzip the files and enter the password that was emailed to you from TOPMed inside the quotes module load p7zip

for file in *.zip; do 7z e $file -p"<password>"; done

PART 4 -- Post-Imputation QC

Step 16 -- Merge Chromosomes

👇 Steps and code

After unzipping we need to merge the files to do downstream analsysis
First make mergelist.txt file

seq 2 22 | sed 's/^/chr/' > mergelist.txt

NEED PLINK1.9 for merge

module load plink/1.9

Merge chromosome files into 1 bim/bed/bam file THIS TAKES A LONG TIME

plink --bfile chr1 --merge-list mergelist.txt --make-bed --out merged

Step 17 -- Run pre-QC Commands

👇 Steps and code

NEED PLINK 2 for these QC Steps #Run QC Commands

First update the .fam files to have the correct ID orientation

head merged.fam

0 HG00096_HG00096 0 0 0 -9
0 HG00098_HG00098 0 0 0 -9
0 HG00101_HG00101 0 0 0 -9
0 HG00102_HG00102 0 0 0 -9
0 HG00105_HG00105 0 0 0 -9
0 HG00106_HG00106 0 0 0 -9
0 HG00107_HG00107 0 0 0 -9
0 HG00108_HG00108 0 0 0 -9
0 HG00109_HG00109 0 0 0 -9
0 HG00111_HG00111 0 0 0 -9

This code will reorient that .fam files to be usable for plink tools downstream analysis
Col 1 = FID
Col 2 = IID
Col 3 = XXX
Col 4 = XXX
Col 5 = XXX
Col 6 = XXX

cat merged.fam | awk '{print $1,$2}' | sed s'/_/ /g' | awk '{print $1,$2"_"$2,$2,$3}' > merged_toUpdate.txt
plink2 --bfile merged --update-ids merged_toUpdate.txt --make-bed --out merged_Updated
head merged_Updated.fam

HG00096	HG00096	0	0	0	-9
HG00098	HG00098	0	0	0	-9
HG00101	HG00101	0	0	0	-9
HG00102	HG00102	0	0	0	-9
HG00105	HG00105	0	0	0	-9
HG00106	HG00106	0	0	0	-9
HG00107	HG00107	0	0	0	-9
HG00108	HG00108	0	0	0	-9
HG00109	HG00109	0	0	0	-9
HG00111	HG00111	0	0	0	-9

ls

Should have the following files:

merged_Updated.bed
merged_Updated.bim
merged_Updated.fam
merged_Updated.log

If you don't, open the merged_Updated.log file and try to discern the error. It's possible you ran the code in the wrong directory.

Step 18 -- Run QC Commands

👇 Steps and code

Want to QC on
- geno 0.01 (geno filters out all variants with missing call rates exceeding the provided value to be removed)

plink2 --bfile merged_Updated --geno 0.01 --make-bed --out merged_Updated_1_geno
ls merged_Updated_1_geno.*

merged_Updated_1_geno.bed
merged_Updated_1_geno.bim
merged_Updated_1_geno.fam
merged_Updated_1_geno.log

Want to QC on
- mind 0.01 (filters out al lthe samples with missing call rates exceeding the provided value to be removed)
- maf 0.05 (filters out all variants with minor allele frequency below the provided threshold)
- hardy (writes a list of genotype counts and Hardy-Weinberg equilibrium exact test statistics to plink.hwe)

plink2 --bfile merged_Updated_1_geno --mind 0.01 --maf 0.05 --hwe 0.05 --make-bed --out merged_Updated_2_QC
ls merged_Updated2_QC.*

merged_Updated_2_QC.bed
merged_Updated_2_QC.bim
merged_Updated_2_QC.fam
merged_Updated_2_QC.log

Step 19 --Remove Relateds

👇 Steps and code

Actually remove the relateds

drop_relateds.sh -b merged_Updated_2_QC -i prefix_pruned_100k_genome_updated.genome_updated.genome -p merged_Updated_2_QC_remove_related

GWAS-Related Resources

We recommend the following resources and tutorials developed for performing GWAS. Due to time constraints, we were not able to fully vet every resource listed below.

Regenie

Regenie Documentation: https://rgcgithub.github.io/regenie/options/
Glow tutorial which implements a distribute version of the Regenie method: https://glow.readthedocs.io/en/latest/tutorial.html
DNAnexus GWAS on the Research Analysis Platform using Regenie: https://www.youtube.com/watch?v=762PVlyZJ-U

Acknowledgements

We thank all co-authors involved in the preparation of this protocol and manuscript: Van Q. Truong, Jakob A. Woerner, Tess A. Cherlin, Yuki Bradford, Daniel H. Hui, Anastasia M. Lucas, Manu K. Shivakumar, Rachit Kumar, Milton Pividori, Chelsea C. Okeh, S. Chris Jones, Abigail C. Bossa, Stephen D. Turner, Marylyn D. Ritchie and Shefali S. Verma
The authors are grateful to the following scientists for their contributions as part of the eMERGE network: Loren L. Armstrong, Christopher S. Carlson, Dana C. Crawford, Andrew T. Crenshaw, Mariza de Andrade, Kimberly F. Doheny, Jonathan L. Haines, Geoffrey Hayes, Gail Jarvik, Lan Jiang, Iftikhar J. Kullo, Rongling Li, Hua Ling, Teri A. Manolio, Martha Matsumoto, Catherine A. McCarty, Andrew N. McDavid, Daniel B. Mirel, Justin E. Paschall, Elizabeth W. Pugh, Luke V. Rasmussen, Russell A. Wilke, and Rebecca L. Zuvich
An additional thank you goes out to Aude Ikuzwe Sindikubwabo, PhD student in Genetics & Epigenetics, for her help in validating this Github workflow during her rotation

RitchieLab/GWAS-QC

GWAS Quality Control (QC)

A complementary tutorial to the (2022) Truong, Woerner, Cherlin, et al. Paper. [PDF]

Contributing

Todo

Basic Overview

Set Up

Set up working directory

Modules in BASH

Notes on PLINK v1.9 and v2.0

Workflow Steps

PART 1 -- Get Data

Step 1 - Retrieve 1000 Genomes Data

Step 2 - Download other relevant data

PART 2 -- Pre-Imputation

Step 1 - Enter the preImputation directory

Step 2 - Check heterogeneity and missingness

Step 3 - Update first column of file which has zeros

Step 4 -- Add sex phenotype

Step 5 -- Remove SNP variants that do not have SNP IDs

Step 6 -- Run QC on data

Step 7 -- LiftOver the data

Step 8 -- Exclude data

Step 9 -- Principal Component Analysis (PCA)

This generates a plot similar to Figure 6 from the paper

Step 10 -- Calculate frequency files and compare to TOPMed panel

Step 11 - Sort and zip files to create VCF files for imputation

Step 12 -- If not already on local computer, copy VCF files to local computer in order to upload to TOPMed impuatation server

Step 13 -- Last Pre-Imputation step to calculate relateds which will be needed later for GWAS

PART 3 -- Genotype Imputation

Step 14 -- Imputation using TOPMed Imputation Server

Step 15 -- Download Imputed Data to Working Directory

PART 4 -- Post-Imputation QC

Step 16 -- Merge Chromosomes

Step 17 -- Run pre-QC Commands

Step 18 -- Run QC Commands

Step 19 --Remove Relateds

GWAS-Related Resources

General

PLINK

SAIGE

HAIL

Regenie

Acknowledgements