This repository contains scripts for analyzing low-coverage fastq files from Phase 3 of the 1000 Genomes project to produce gVCF files. The scripts are designed to run on Amazon EC2 using StarCluster. The scripts are run on a custom AMI which is described below. Raw fastq files from each sample are aligned with bwa, duplicates are marked with samblaster, alignments are sorted and indexed using sambamba and variants are called using the GATK. The /data
directory contains an NFS shared EBS volume with the indexed reference genome (hs37d5.fa) and space for log files.
The AMI contains the following software:
- Python3
- retrying
- awscli
- boto3
- mdadm (software RAID)
- bwa version 0.7.15
- samblaster version 0.1.22
- sambamba version 0.6.3
- Java 8
- The GATK jar file at /usr/local/bin/ version 3.5.0