visualization and re-calling of SNPs from vcf file
This set of R-scripts allows visualization and manipulation of SNP genotype data starting from a vcf file.
- import a vcf file containing the genotype information from a set of bam files, e.g., from FREEBAYES.
- optional extraction of read depth by locus (output = violin plots)
- optional visualization of depth by samples and loci (output = heatmap)
- optional check for polymorphic and biallelic loci
- extract the relevant data from the vcf file, format into a tidy dataframe and parse data.
- Transform data into matrix of samples by loci, 2 columns per locus; write to .csv for analysis or data checking.
- Generate similar matrix that also includes read counts for each allele. Write to .csv for data checking.
- Generate plots of allele counts for each genotype, and at different scales for data checking. Output = pdf of 3 plots per locus, with guide lines for allelic ratios of 0.3 and 0.4 to check how well genotypes fit different calling schemes (e.g., minimum depth, allelic ratio) 9-10) Modify genotypes based on applying new minimum depth (minDP), allelic ratios, and excluded loci. (based on .csv file of parameters for each locus).
- re-plot data as in 8 to visualize changes
- optional re-generate genotype matrix for data checking (repeat 9-10 above until all loci are acceptable)
- remove or re-call individual genotypes (needs work; very labor intensive)
- transform data for export as sample by locus matrix (ready for import by strataG for various population analyses)
- re-plot final data set for records
- export data as one row per genotype (locus, position, sample_ID, genotype, allele1 depth, allele2 depth), for storage in database.