/MergeGenome

Primary LanguagePythonOtherNOASSERTION

MergeGenome - Toolkit for Merging VCF files

MergeGenome Diagram

This repository includes a Python implementation of the MergeGenome toolkit, which underlies the importance of cleaning genomic sequences prior to analysis. The MergeGenome toolkit is designed to integrate DNA sequences from a query and a reference datasets in variant call format (VCF) while targeting data quality. MergeGenome is a robust pipeline of comprehensive steps to merge both datasets, including chromosome nomenclature standardization, SNP ambiguities removal, SNP flips detection, SNP mismatches elimination, and query/reference mismatches detection and/or fixing. MergeGenome works with any organism’s DNA sequences, which brings a broad solution to having access to more than one source of data but only being able to exploit one for statistical analysis.

This repository also includes the implementation of other common tasks related to merging genomic sequences, such as identifying the common markers (i.e. SNPs with identical CHROM, POS, REF, and ALT fields) between two datasets and subsetting the available data to those common markers.

Preprocessing Steps

  1. Partition data into a separate VCF file per chromosome

  2. Rename chromosome notation

  3. Clean VCF files

  4. Impute

  5. Subset SNPs to common markers with another dataset

  6. Machine Learning Source Classifiers for SNP Filtering

  7. SNP Error Correction

  8. Concat VCF files

  9. Merge VCF files

Merging Evaluation

  1. Discriminator Performance

  2. Plot SNP means (comparison)

  3. Plot Principal Component Analysis (PCA)

Other util commands

  1. Remove SNPs with different means

  2. Store indexes common markers

  3. Store allele data (from VCF to NPY or H5)

  4. Store allele data (from NPY to VCF)

License

This project is under the CC BY-NC 4.0 license. See LICENSE for details.