zhanxw/rvtests

SKAT/CMC: Missing covariates are not imputed, but dropped

katyaorlova opened this issue · 2 comments

Thank you for creating and maintaining this software.

In the wiki, you state that

Note: Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA). They will be automatically imputed to the mean value in the data file.

However, samples with missing covariates are simply dropped from my analysis, per the .log file when running SKAT, CMC, FamCMC, FamSKAT:

[WARN] Total [ 63 ] samples are dropped from VCF file due to missing covariate.

How should I assure that my samples with missing covariates are not dropped?

For reference, here's a simplified version of my codewhen running FamSKAT + FamCMC:
rvtest --inVcf exons.vcf.gz --pheno phenos.txt --pheno-name dft --freqUpper 0.01 --impute drop --covar cov.txt --covar-name AgeAtExam,Sex,V7,V8,V9,WV,ChipNum,CohortNum,PC1_C12,PC2_C12,PC3_C12 --geneFile refFlat_hg19.txt.gz --burden famcmc --kernel famskat --kinship C1C2.kinship --numThread 3 --out output;
(Note, I tried removing the --impute drop flag, which prevents imputation of missing genotypes, but this doesn't alter covariate dropping)

Thank you in advance,
Katya

Yes, thank you for the quick reply; I ended up doing just that. I mostly wrote to double check whether there was a different issue that was causing this in my code, but it sounds like it is a default setting to drop samples with NA covariates.

Here's the code if anyone wants to save time:

`cols_to_impute <- c("V7", "V8", "V9")

for (col_name in cols_to_impute) {
col <- cov[, col_name]
col_mean <- mean(col, na.rm = TRUE)
cov[is.na(col), col_name] <- col_mean
} `

Thanks again,
Katya