/sim1000G

Simulation of rare and common variants based on 1000 genomes data

Primary LanguageR

sim1000G

statistics

R package for simulating genetic marker data.

Author: Apostolos Dimitromanolakis

Maintainer: Apostolos Dimitromanolakis

Installation

The most stable version can be found at CRAN:

install.packages("sim1000G")

Quickstart

The following script will generate variants in the region of the example vcf file for 200 unrelated individuals:

library(sim1000G)


# Read the example file included in sim1000G

examples_dir = system.file("examples", package = "sim1000G")
vcf_file = file.path(examples_dir,"region.vcf.gz")


# Alternatively provide a vcf file here:
# vcf_file = "~/fs/tmp/sim4/pop1/region-chr4-312-GABRB1.vcf.gz"

vcf = readVCF( vcf_file, maxNumberOfVariants = 600 , min_maf = 0.01, max_maf = 1)

startSimulation(vcf, totalNumberOfIndividuals = 1000)
ids = generateUnrelatedIndividuals(200)

genotype = retrieveGenotypes(ids)


With the genotypes we can compare the allele frequencies with the ones in the original vcf file:

# Compare MAF of simulataed data and vcf
plot( apply(genotype,2,mean)/2 ,  apply(vcf$gt1+vcf$gt2,1,mean)/2 )
abline(0,1,lty=1,lwd=9,col=rgb(0,0,1,0.3))

An image showing the generated genotypes:

# show the genotypes as an image

gplots::heatmap.2(genotype,col=c("white","orange","red"),Colv=F, trace="none")

We can also compute the correlation between the markers and show an LD plot of the region:

# LD plot of region

gplots::heatmap.2( cor(genotype)^2 , trace="none", col=rev(heat.colors(200)) ,Rowv=F,Colv=F )


Documenation

A more detailed documentation and code examples can be found at:

SimulatingFamilyData