HackBio Internship 2021: Genomics-One-B
HackBio is a virtually regimented research internship that is practice oriented and focused on equipping African scientists with advanced bioinformatics and computational biology skills. By the end of internship, successful interns should have:
- Honed their skills in a specific bioinformatics method
- Have at least a peer-reviewed article to show for the internship experience
By: Anton Nekrutenko and Alex Ostrovsky
A handful of life ranging from prokaryotes, down to viruses and a few extension operate on non-diploid mechanism. In this tutorial Team Genomics_One_B will be recreating the above project which involves working on four datasets, gotten from human genomic DNA sequencing. The aim of this is to identify heteroplasmies variant within the mitochondria DNA using Galaxy packages.
The raw reads were downloaded from here
https://zenodo.org/record/1251112/files/raw_child-ds-1.fq
https://zenodo.org/record/1251112/files/raw_child-ds-2.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-1.fq
https://zenodo.org/record/1251112/files/raw_mother-ds-2.fq
In this tutorial, we will cover:
- Download datasets from resource page
- Click upload data on Galaxy web page
- Galaxy will prompt to ask if it is from the local files or web (it depends on where you saved the dataset)
- After uploading, click start. Once import is completed, the dataset highlight turns green as seen on the picture below.
It is important to check the quality of the data to be used before proceeding with the analysis. This is done to determine if there is a problem with the dataset. Click on FASTA/Fastq on the left hand side, select 'FastQC Read Quality Check' and execute. It will run a check on the data.
Human genome, ‘hg38’ was used as the reference genome.Using the Paired end sequencing, the datasets has to be uploaded by selecting multiple datasets as follows:
- First set of reads: both dataset 1
raw_child-ds-1.fq
&raw_mother-ds-1.fq
- Second set of reads: both datasets 2
raw_child-ds-2.fq
&raw_mother-ds-2.fq
Set read groups information to “Set read groups SAM/BAM specification
and Execute
Step 4.1: Merging BAM
datasets
-
Select Picard tool
-
Click Merge
SAM
Files tool, then import dataset obtained from Step 3 into the dataset collection. -
Input parameters as seen in the image below. Then execute.
Step 4.2: Removing duplicates using MarkDuplicates
- Select Picard on the left side panel.
- Click on
MarkDuplicates
In theSAM/BAM
or dataset collection box. - Upload merged SAMFiles Input parameters as seen below and Execute
Step 4.3: Left-aligning indels using BamLeftAlign Tool
- Left aligning of indels is important for obtaining accurate variant calls (The BAM dataset generated by MarkDuplicates will be used to run this step)
- Select BamLeftAlign tool
- Input MarkDuplicate dataset and use reference genome: hg38 (Input parameters as seen in picture below) then Execute
Step 4.4: Filtering reads
- Select filter under BamTools
- Using MarkDuplicates dataset, input parameters as seen in picture below.
- Execute (NB: the parameter, 'would you like to set rules' should be set to NO)
You can navigate to the tool (FreeBayes) using the search button in Galaxy. Select the reference genome, mode of run and the BAM file input. Set the parameters for the following options (population mode, allelic scope, input filter) as seen in the images below.
- Navigate to tool (VCFfilter) using the search button.
- Using the dataset obtained from variant call (step 5), Input parameters as seen below.
- Execute
- Click on processed VCF datasets, it will expand to show link.
- Click on “display at vcf.iobio” at the bottom
- Use the reference genome, Human hg38 for comparison
- VCF datasets will be index to display them
- Repeat process for IGV by clicking on "Display with IGV"
Though visualizing VCF datasets is a good way to get an overall idea, it does not explain many details. To play a little more with data,
- Convert VCF dataset into a tab-delimited representation using VCFtoTab-delimited
As we opted for “Report data per sample”(four), this will produce a dataset with many columns (In this tutorial, 62 columns were produced out of which only six are necessary)
Then proceed to cut these columns out (refer to image below)
At position 3243, the mother sample has 671 G’s (‘G’ – an alternative allele) and depth of coverage is 2057 so, 2057-671 = 1386 A’s. At the same position, the child sample has 694 G’s and 1035-694 = 341 A’s.
Allele | A | G |
---|---|---|
Mother | 1386 | 671 |
Child | 341 | 694 |
We noticed a remarkable frequency change i.e., the major allele in the mother ‘A’ becomes the minor allele in the child.
To access our data and results on the drive click
-
Temmykeji
- Graphic design of workflow, Dataset and FastQC -
Solomon
- Dataset, FastQC and Github Markdown -
Rajeshcha44
- Mapping of read using BWA-MEM -
Nitigya-M
- Mapping of read using BWA-MEM -
abdnahid_
- Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter and GitHub Markdown -
Mike
- Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter -
Karteek
- Merging BAM datasets with MergeSAMFiles, Removing duplicates with MarkDuplicates, Left-aligning indels with BamLeftAlign, Filtering reads with BAMTools filter -
Priyacomp
- Variant calling of dataset -
MANGAIYARKARASI
- Variant calling of dataset -
Pragna_lakshmi
- Variant calling of dataset using FreeBayes and Comparing of frequencies using VCFtoTab-delimited -
Naomi
- Mapping of read using BWA-MEM -
Galaxy
- Filtering of variant call dataset using FreeBayes -
Aarathi04
- Filtering of variant call dataset using VCFfilter -
Gautami
(Team Leader) - Visualization using IGV and VCF.IOBIO; and Comparing of frequencies using VCFtoTab-delimited -
ZubairAlam
- Visualization using IGV and VCF.IOBIO -
Shreyashi
- Visualization using IGV and VCF.IOBIO -
omimill
- Comparing of frequencies using VCFtoTab-delimited and Github Markdown