Google-Health/genomics-research

Categorical covariates with more than two levels

Luming-L opened this issue · 2 comments

Hi,

My input file contains several categorical covariates with more than two levels. For example, the covariate smoker has three levels: "non-smoker", "past smoker" and "current smoker". When running DeepNull, I got this error:

Cast string to float is not supported

After a long search, I realised that only numbers are accepted in regression. It seems that converting strings to numbers has not been embedded in DeepNull yet. Therefore, my questions are:

  • If I want to include categorical covariates with more than two levels, should I recode these categorical covariates before running DeepNull?
  • If yes, which encoding type is more proper for unordered categorical variable, one-hot encoding, dummy encoding or anything else?

Thanks!

Hi Luming,

Thank for you interested in DeepNull.

It is true that DeepNull can not deal with string categorical covariates similar to most GWAS pipeline methods (EMMA, BOLT-LMM, GEMMA, REGENIE, etc).

I would use the same encoding that you will use for your GWAS analysis. I personally would prefer to use dummy encoding.

Thanks,
Farhad

Hi Farhad,

Thanks for the clarification! Dummy encoding also makes sense to my analysis.

Cheers,
Luming