Is there a way to specify percentage of causal variants in SNPs simulation process using sim1000G?
Closed this issue · 1 comments
BGerm95 commented
Dear @adimitromanolakis and all others,
I am using sim1000G to simulate data for case/control study. I can not figure it out how to manipulate with this code to be able to generate 10% or 50% causal SNPs.
Can anyone explain to me how I can modify code in a way that it will simulate:
- 50 % of causal SNPs ( exmpl. 24 causal variants and 24 non causal SNPs)
2. 10 % of causal SNPs (exmpl. 5 causal and 43 non causal SNPs)
I would like to reuse code from Example3:
library(sim1000G)
vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000
vcf = readVCF( vcf_file, maxNumberOfVariants = 442 ,min_maf = 0.0005,max_maf = 0.01) #lowest MAF
dim( vcf$gt1 ) #rows represent number of variants, columns represent number of individuals
## Download and use full chromosome genetic map
downloadGeneticMap(4)
readGeneticMap(4)
sample.size=3000
startSimulation(vcf, totalNumberOfIndividuals = sample.size)
data_sim = function(seed.num){
SIM$reset()
id = generateUnrelatedIndividuals(sample.size)
gt = retrieveGenotypes(id)
freq = apply(gt,2,sum)/(2*nrow(gt))
causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)
beta.sign = rep(1,45)
c.value = 0.402
beta.abs = c.value*abs(log10(freq[causal]))
beta.val = beta.sign*beta.abs
x.bar = apply(gt[,causal],2,mean)
x.bar = as.matrix(x.bar)
beta.val = t(as.matrix(beta.val))
#disease prvalance = 1%
#beta0 = -log(99)-beta.val %*% x.bar
#disease prvalance = 1.5%
beta0 = 0-beta.val %*% x.bar
eta = beta.val %*% t(gt[,causal])
eta = as.vector(eta) + rep(beta0,nrow(gt))
prob = exp(eta)/(1+exp(eta))
genocase = rep(NA, sample.size)
set.seed(seed.num)
for(i in 1:sample.size){
genocase[i] = rbinom(1, 1, prob[i])
}
case.idx = sample(which(genocase==1),1000)
control.idx = sample(which(genocase==0),1000)
return(rbind(gt[case.idx,],gt[control.idx,]))
}
I will be grateful for all suggestions. Thank you in advance.
adimitromanolakis commented
Hi,
you can modify the line:
causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)
This selects 45 causal variants, among variants that have non-zero
frequency (computed in the line above).
Best,
Apostolos
…On Wed, 28 Oct 2020 at 16:10, BGerm ***@***.***> wrote:
Dear @adimitromanolakis <https://github.com/adimitromanolakis> and all
others,
I am using sim1000G to simulate data for case/control study. I can not
figure it out how to manipulate with this code to be able to generate
*10%* or *50% causal SNPs*. I assume that can be set up in this part of
code.
freq = apply(gt,2,sum)/(2*nrow(gt))
causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)
Can anyone explain to me what those formulas means and how I can modify
code in a way that it will simulate:
1. 50 % of causal SNPs ( exmpl. 24 causal variants and 24 non causal
SNPs)
2. 10 % of causal SNPs (exmpl. 5 causal and 43 non causal SNPs)
I would like to reuse code from Example3:
library(sim1000G)
vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000
vcf = readVCF( vcf_file, maxNumberOfVariants = 442 ,min_maf = 0.0005,max_maf = 0.01) #lowest MAF
dim( vcf$gt1 ) #rows represent number of variants, columns represent number of individuals
## Download and use full chromosome genetic map
downloadGeneticMap(4)
readGeneticMap(4)
sample.size=3000
startSimulation(vcf, totalNumberOfIndividuals = sample.size)
data_sim = function(seed.num){
SIM$reset()
id = generateUnrelatedIndividuals(sample.size)
gt = retrieveGenotypes(id)
freq = apply(gt,2,sum)/(2*nrow(gt))
causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)
beta.sign = rep(1,45)
c.value = 0.402
beta.abs = c.value*abs(log10(freq[causal]))
beta.val = beta.sign*beta.abs
x.bar = apply(gt[,causal],2,mean)
x.bar = as.matrix(x.bar)
beta.val = t(as.matrix(beta.val))
#disease prvalance = 1%
#beta0 = -log(99)-beta.val %*% x.bar
#disease prvalance = 1.5%
beta0 = 0-beta.val %*% x.bar
eta = beta.val %*% t(gt[,causal])
eta = as.vector(eta) + rep(beta0,nrow(gt))
prob = exp(eta)/(1+exp(eta))
genocase = rep(NA, sample.size)
set.seed(seed.num)
for(i in 1:sample.size){
genocase[i] = rbinom(1, 1, prob[i])
}
case.idx = sample(which(genocase==1),1000)
control.idx = sample(which(genocase==0),1000)
return(rbind(gt[case.idx,],gt[control.idx,]))
}
I will be grateful for all suggestions. Thank you in advance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEGJVY3SSCT35535WBZGTWLSNAQ5TANCNFSM4TCMBG4A>
.