HackBio Internship 2021: Genomics-One-B

HackBio is a virtually regimented research internship that is practice oriented and focused on equipping African scientists with advanced bioinformatics and computational biology skills. By the end of internship, successful interns should have:

Honed their skills in a specific bioinformatics method
Have at least a peer-reviewed article to show for the internship experience

PROJECT WORKFLOW & DESIGN

Calling variants in non-diploid systems

By: Anton Nekrutenko and Alex Ostrovsky

Introduction

A handful of life ranging from prokaryotes, down to viruses and a few extension operate on non-diploid mechanism. In this tutorial Team Genomics_One_B will be recreating the above project which involves working on four datasets, gotten from human genomic DNA sequencing. The aim of this is to identify heteroplasmies variant within the mitochondria DNA using Galaxy packages.

The raw reads were downloaded from here

https://zenodo.org/record/1251112/files/raw_child-ds-1.fq
https://zenodo.org/record/1251112/files/raw_child-ds-2.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-1.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-2.fq

In this tutorial, we will cover:

STEP 1: IMPORTING DATASET

Download datasets from resource page
Click upload data on Galaxy web page
Galaxy will prompt to ask if it is from the local files or web (it depends on where you saved the dataset)
After uploading, click start. Once import is completed, the dataset highlight turns green as seen on the picture below.

STEP 2: QUALITY CHECK OF DATASET

It is important to check the quality of the data to be used before proceeding with the analysis. This is done to determine if there is a problem with the dataset. Click on FASTA/Fastq on the left hand side, select 'FastQC Read Quality Check' and execute. It will run a check on the data.

STEP 3: MAPPING THE READS USING BWA MEM

Human genome, ‘hg38’ was used as the reference genome.Using the Paired end sequencing, the datasets has to be uploaded by selecting multiple datasets as follows:

First set of reads: both dataset 1 raw_child-ds-1.fq & raw_mother-ds-1.fq
Second set of reads: both datasets 2 raw_child-ds-2.fq & raw_mother-ds-2.fq

Set read groups information to “Set read groups SAM/BAM specification and Execute

STEP 4: POST-PROCESSING MAPPED READ

Step 4.1: Merging BAM datasets

Select Picard tool
Click Merge SAM Files tool, then import dataset obtained from Step 3 into the dataset collection.
Input parameters as seen in the image below. Then execute.

Step 4.2: Removing duplicates using MarkDuplicates

Select Picard on the left side panel.
Click on MarkDuplicates In the SAM/BAM or dataset collection box.
Upload merged SAMFiles Input parameters as seen below and Execute

Step 4.3: Left-aligning indels using BamLeftAlign Tool

Left aligning of indels is important for obtaining accurate variant calls (The BAM dataset generated by MarkDuplicates will be used to run this step)
Select BamLeftAlign tool
Input MarkDuplicate dataset and use reference genome: hg38 (Input parameters as seen in picture below) then Execute

Step 4.4: Filtering reads

Select filter under BamTools
Using MarkDuplicates dataset, input parameters as seen in picture below.
Execute (NB: the parameter, 'would you like to set rules' should be set to NO)

STEP 5: CALLING NON-DIPLOID VARIANTS USING FREEBAYES

You can navigate to the tool (FreeBayes) using the search button in Galaxy. Select the reference genome, mode of run and the BAM file input. Set the parameters for the following options (population mode, allelic scope, input filter) as seen in the images below.

STEP 6: FILTERING VARIANTS USING VCF

Navigate to tool (VCFfilter) using the search button.
Using the dataset obtained from variant call (step 5), Input parameters as seen below.
Execute

STEP 7: VISUALIZATION USING IGV

Click on processed VCF datasets, it will expand to show link.
Click on “display at vcf.iobio” at the bottom
Use the reference genome, Human hg38 for comparison
VCF datasets will be index to display them
Repeat process for IGV by clicking on "Display with IGV"

STEP 8: COMPARING FREQUENCIES

Though visualizing VCF datasets is a good way to get an overall idea, it does not explain many details. To play a little more with data,

Convert VCF dataset into a tab-delimited representation using VCFtoTab-delimited

As we opted for “Report data per sample”(four), this will produce a dataset with many columns (In this tutorial, 62 columns were produced out of which only six are necessary)

Then proceed to cut these columns out (refer to image below)

INTERPRETATION OF RESULT:

At position 3243, the mother sample has 671 G’s (‘G’ – an alternative allele) and depth of coverage is 2057 so, 2057-671 = 1386 A’s. At the same position, the child sample has 694 G’s and 1035-694 = 341 A’s.

Allele	A	G
Mother	1386	671
Child	341	694

We noticed a remarkable frequency change i.e., the major allele in the mother ‘A’ becomes the minor allele in the child.

To access our data and results on the drive click

Contributors:

Temmykeji - Graphic design of workflow, Dataset and FastQC
Solomon - Dataset, FastQC and Github Markdown
Rajeshcha44 - Mapping of read using BWA-MEM
Nitigya-M - Mapping of read using BWA-MEM
abdnahid_ - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter and GitHub Markdown
Mike - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter
Karteek - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter
Priyacomp - Variant calling of dataset
MANGAIYARKARASI - Variant calling of dataset
Pragna_lakshmi - Variant calling of dataset using FreeBayes and Comparing of frequencies using VCFtoTab-delimited
Naomi - Mapping of read using BWA-MEM
Galaxy - Filtering of variant call dataset using FreeBayes
Aarathi04 - Filtering of variant call dataset using VCFfilter
Gautami(Team Leader) - Visualization using IGV and VCF.IOBIO; and Comparing of frequencies using VCFtoTab-delimited
ZubairAlam - Visualization using IGV and VCF.IOBIO
Shreyashi - Visualization using IGV and VCF.IOBIO
omimill - Comparing of frequencies using VCFtoTab-delimited and Github Markdown

mikemwanga/Genomics-One-B