tshmak/lassosum

[enhancement] Can you please give more document for lassosum.pipeline

wavefancy opened this issue · 7 comments

Can you please give more document for lassosum.pipeline for help understand the results?

Thanks much!
Wallace

Sorry for the late reply. Because of lack of time many of the functions in lassosum are still undocumented. Are there any specific features you would like better documentation?

Hi,
I think content of the results file need to be documented (For me this is required).
Thanks,
Deepak.

(...). Are there any specific features you would like better documentation?

I am having problems with understanding the functions, especially that there are other approaches in plink, LDpred and PRSice that use some of the names in different meanings. For example reference and test files. In PRSice "reference" refers to LD panel, but here reference means just training dataset, right?

Another question I have is: is relatedness in your test dataset a problem? In PRSice only founders are included in logistic regression (for binary phenotype), hence reduced sample size.

And another one: how are the phenotypes encoded? 0/1 or 1/2? Are missing phenotypes allowed? Are they encoded by -9?

Also, where can I check the level of association of PGS with phenotype? Some Nigelkerke's R squared measure? And does the difference between general phenotype prevalence in general population vs. in the provided dataset makes a difference? I.e. I am asking about ascertainment error.

Thanks for raising these issues. "Reference" stands for reference panel, and you can call it the LD panel if you like since yes, the LD is the main thing we're taking from the panel. It doesn't have to be the training dataset, which isn't available in general anyway.

The issue of relatedness is a complicated one. As far as I know, the only substantial reference to this problem is Wray et al (2013, pitfalls of predicting complex traits from SNPs, Nat Gen Rev), and even in there the discussion is brief and no theoretical/empirical results were presented. Relatedness generally inflates estimates of the PGS R2 as compared with the PGS R2 in an otherwise unrelated population. Therefore, whether this is a problem depends on exactly what your target population is, and also, whether the prediction R2 is important to you. However, to avoid criticism by others, you may want to exclude related individuals if they are not too many. lassosum doesn't have a function for this, but you can use the keep.test or remove.test option to keep/remove individuals who you've filtered in/out in, say, plink.

lassosum doesn't use the plink convention of 1/2/-9 for case/control/missing. Using 1/2 instead of 0/1 should not affect results as the phenotype is only used for validation using correlation. However, -9 can mess things up. Use NA for missing as in R. I just put in a warning for this in the documentation.

By default lassosum uses the correlation as a measure of fit, and the best correlation achieved is given in the best.validation.result object within the list outputted from validate(). You can change the validation function from correlation to something else by specifying the validate.function option. However, this is somewhat advanced and you need to make sure your function doesn't return NA if missing values are encountered. Otherwise, you can simply calculate Nagelkerke R2 (in R) using your phenotype and the best.pgs that is returned.