/1kgenomes-gvcfs

Scripts for creating gVCF files for 1000 genomes data using StarCluster.

Primary LanguagePython

1kgenomes-gvcfs

Overview

This repository contains scripts for analyzing low-coverage fastq files from Phase 3 of the 1000 Genomes project to produce gVCF files. The scripts are designed to run on Amazon EC2 using StarCluster. The scripts are run on a custom AMI which is described below. Raw fastq files from each sample are aligned with bwa, duplicates are marked with samblaster, alignments are sorted and indexed using sambamba and variants are called using the GATK. The /data directory contains an NFS shared EBS volume with the indexed reference genome (hs37d5.fa) and space for log files.

AMI Details

The AMI contains the following software:

  • Python3
    • retrying
    • awscli
    • boto3
  • mdadm (software RAID)
  • bwa version 0.7.15
  • samblaster version 0.1.22
  • sambamba version 0.6.3
  • Java 8
  • The GATK jar file at /usr/local/bin/ version 3.5.0