hj-pangolin: An HTML repository from fish546-2018

Developing SNPs in multiple species of pangolins for population differentiation

The goal of this project is to build a reproducible pipeline that takes whole genome sequence data from pangolins and identifies single nucleotide polymorphisms (SNPs) between two species.

Objectives

Create a well documented and reproducible pipeline that:

Runs FASTQC to check for quality of reads
Aligns fastq files to a reference genome using BWA
Identifies SNPs between species using FreeBayes and ANGSD

Repository Structure

data
README and files containing information about data files. Raw data and reference genomes files are too big to store on Github.

raw-data: contains .fastq.gz files
reference-genome: contains downloaded reference genome .fa and .gff.gz files

tutorials
Jupyter and R notebooks from tutorials in class.

BLAST tutorial

notebooks
Jupyter notebooks used for analyses.

Notebook containing md5checksum check for reference genome

scripts
Bash scripts used to run analyses on Mox.

analyses Results and intermediate files from analysis.

aligned-files: contains .sam and .bam files
fastqc: contains FASTQC and multiQC results
genome: contains scaffold length text file

Project Timeline

Week 4: Set up project directory and organization for running analyses on Mox

Week 5: Run FASTQC on raw sequences files using GNU parallel to learn how to split up commands

Week 6: Check md5sum of the downloaded reference genome and index reference genome for BWA

Week 7: Run BWA on fastq files for all 10 individuals

Week 9: Run FreeBayes and ANGSD on aligned bam files

Week 10: Visualize the process and results of the project

Next Steps

Filter identified SNPs using various quality filter and identify the top most informative SNPs
Use a genome-aware primer designing software to design primers around SNPs of interest
Sequence museum samples and re-analyze the data with full dataset

Adam Tusk / CC BY 2.0

fish546-2018/hj-pangolin

Developing SNPs in multiple species of pangolins for population differentiation

Objectives

Repository Structure

Project Timeline

Next Steps