HackBio Internship 2021: Genomics-One-B

hackbio image

HackBio is a virtually regimented research internship that is practice oriented and focused on equipping African scientists with advanced bioinformatics and computational biology skills. By the end of internship, successful interns should have:

  • Honed their skills in a specific bioinformatics method
  • Have at least a peer-reviewed article to show for the internship experience

PROJECT WORKFLOW & DESIGN

hackbio ads

Calling variants in non-diploid systems

By: Anton Nekrutenko and Alex Ostrovsky

Introduction

A handful of life ranging from prokaryotes, down to viruses and a few extension operate on non-diploid mechanism. In this tutorial Team Genomics_One_B will be recreating the above project which involves working on four datasets, gotten from human genomic DNA sequencing. The aim of this is to identify heteroplasmies variant within the mitochondria DNA using Galaxy packages.

The raw reads were downloaded from here

https://zenodo.org/record/1251112/files/raw_child-ds-1.fq
https://zenodo.org/record/1251112/files/raw_child-ds-2.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-1.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-2.fq

In this tutorial, we will cover:

STEP 1: IMPORTING DATASET

  • Download datasets from resource page
  • Click upload data on Galaxy web page
  • Galaxy will prompt to ask if it is from the local files or web (it depends on where you saved the dataset)
  • After uploading, click start. Once import is completed, the dataset highlight turns green as seen on the picture below.

gd

STEP 2: QUALITY CHECK OF DATASET

It is important to check the quality of the data to be used before proceeding with the analysis. This is done to determine if there is a problem with the dataset. Click on FASTA/Fastq on the left hand side, select 'FastQC Read Quality Check' and execute. It will run a check on the data.

gc

gd

STEP 3: MAPPING THE READS USING BWA MEM

Human genome, ‘hg38’ was used as the reference genome.Using the Paired end sequencing, the datasets has to be uploaded by selecting multiple datasets as follows:

  • First set of reads: both dataset 1 raw_child-ds-1.fq & raw_mother-ds-1.fq
  • Second set of reads: both datasets 2 raw_child-ds-2.fq & raw_mother-ds-2.fq

Set read groups information to “Set read groups SAM/BAM specification and Execute

img_20210820_131657

img_20210820_131439

STEP 4: POST-PROCESSING MAPPED READ

Step 4.1: Merging BAM datasets

  • Select Picard tool

  • Click Merge SAM Files tool, then import dataset obtained from Step 3 into the dataset collection.

  • Input parameters as seen in the image below. Then execute.

    41

Step 4.2: Removing duplicates using MarkDuplicates

  • Select Picard on the left side panel.
  • Click on MarkDuplicates In the SAM/BAM or dataset collection box.
  • Upload merged SAMFiles Input parameters as seen below and Execute

42

Step 4.3: Left-aligning indels using BamLeftAlign Tool

  • Left aligning of indels is important for obtaining accurate variant calls (The BAM dataset generated by MarkDuplicates will be used to run this step)
  • Select BamLeftAlign tool
  • Input MarkDuplicate dataset and use reference genome: hg38 (Input parameters as seen in picture below) then Execute

43

Step 4.4: Filtering reads

  • Select filter under BamTools
  • Using MarkDuplicates dataset, input parameters as seen in picture below.
  • Execute (NB: the parameter, 'would you like to set rules' should be set to NO)

44

STEP 5: CALLING NON-DIPLOID VARIANTS USING FREEBAYES

You can navigate to the tool (FreeBayes) using the search button in Galaxy. Select the reference genome, mode of run and the BAM file input. Set the parameters for the following options (population mode, allelic scope, input filter) as seen in the images below.

51

52

53

54

STEP 6: FILTERING VARIANTS USING VCF

  • Navigate to tool (VCFfilter) using the search button.
  • Using the dataset obtained from variant call (step 5), Input parameters as seen below.
  • Execute

61

62

STEP 7: VISUALIZATION USING IGV

  • Click on processed VCF datasets, it will expand to show link.
  • Click on “display at vcf.iobio” at the bottom
  • Use the reference genome, Human hg38 for comparison
  • VCF datasets will be index to display them
  • Repeat process for IGV by clicking on "Display with IGV"

2 _visualization_options_2

STEP 8: COMPARING FREQUENCIES

Though visualizing VCF datasets is a good way to get an overall idea, it does not explain many details. To play a little more with data,

  • Convert VCF dataset into a tab-delimited representation using VCFtoTab-delimited

IMG-20210820-WA0017

As we opted for “Report data per sample”(four), this will produce a dataset with many columns (In this tutorial, 62 columns were produced out of which only six are necessary)

IMG-20210820-WA0018

Then proceed to cut these columns out (refer to image below)

IMG-20210820-WA0019

INTERPRETATION OF RESULT:

IMG-20210820-WA0020

At position 3243, the mother sample has 671 G’s (‘G’ – an alternative allele) and depth of coverage is 2057 so, 2057-671 = 1386 A’s. At the same position, the child sample has 694 G’s and 1035-694 = 341 A’s.

Allele  A G
Mother   1386    671
Child     341     694

We noticed a remarkable frequency change i.e., the major allele in the mother ‘A’ becomes the minor allele in the child.

To access our data and results on the drive click

Contributors:

  • Temmykeji - Graphic design of workflow, Dataset and FastQC

  • Solomon - Dataset, FastQC and Github Markdown

  • Rajeshcha44 - Mapping of read using BWA-MEM

  • Nitigya-M - Mapping of read using BWA-MEM

  • abdnahid_ - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter and GitHub Markdown

  • Mike - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter

  • Karteek - Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter

  • Priyacomp - Variant calling of dataset

  • MANGAIYARKARASI - Variant calling of dataset

  • Pragna_lakshmi - Variant calling of dataset using FreeBayes and Comparing of frequencies using VCFtoTab-delimited

  • Naomi - Mapping of read using BWA-MEM

  • Galaxy - Filtering of variant call dataset using FreeBayes

  • Aarathi04 - Filtering of variant call dataset using VCFfilter

  • Gautami(Team Leader) - Visualization using IGV and VCF.IOBIO; and Comparing of frequencies using VCFtoTab-delimited

  • ZubairAlam - Visualization using IGV and VCF.IOBIO

  • Shreyashi - Visualization using IGV and VCF.IOBIO

  • omimill - Comparing of frequencies using VCFtoTab-delimited and Github Markdown