/TargetMalaria_UROP_Imperial_College

Analysis and Classification of Anopheles Populations in AG1000-3R dataset using Intergenic SNPs. Target-Malaria Group in Burt Lab, Imperial College London

Primary LanguageJupyter Notebook

Analysis and Classification of Anopheles Populations in AG1000-3R dataset using Intergenic SNPs

Target-Malaria Group in Burt Lab, Imperial College London

Objectives Covered

  • Exploratory Data Analysis
    • SNPs Filtering
      • Mega Base Pair Selection
      • MAF Filtering
      • LD Pruning
    • Minor Allele Filtering - Exploring the right MAF threhsold for rare allele filtering while preserving private alleles
    • Unsupervised exploration - PCA and UMAP visualizations for 4.8 Million SNPs and samples from 16 populations. UMAP hyperparameter tuning for chromosome arm 3R.
  • Classification of 13 Populations
    • Pipeline for population classification using genetic sequences
    • Futher improvement through dimensionality reduction and domain related techniques
  • Pairwise analysis of 66 population pairs
  • Exploring SNP contribution and importance for population differentiation
  • Generic Python functions to reproduce and automate most of the analyses

Dataset

  • MalariaGen AG1000 Phase 2 AR1 release
  • 2,284 Haplotypic samples or 1,142 individual samples from 16 populations -

BFcol, BFgam, AOcol, CIcol, CMgam, FRgam, GAgam, GHcol, GHgam, GM, GNcol, GNgam, GQgam, GW, KE, and UGgam

  • 4,836,295 Intergenic SNPs from chromosome arm 3R
  • Phased Haplotype data/biallelic (0 or 1)