Japal/zCompositions

Unexpected Warning and Error Using lrEM Function with High z.warning Threshold and z.delete Set to FALSE

Closed this issue · 8 comments

Hi! First of all, thank you for creating this package!

I'm encountering an issue with the lrEM function from the zCompositions package when handling a dataset containing a significant amount of zeros. The warnings suggest that columns and rows with more than 80% zeros/unobserved values are being deleted, even though I have explicitly set z.warning to 0.992 and z.delete to FALSE. Additionally, the process results in an error related to undefined columns being selected.

Function Call and Warning Messages:

Here is the function call I used:

lrEM(df, 
     label = 0, 
     dl = rep(10, ncol(df)), 
     rob = TRUE, 
     ini.cov = "multRepl", 
     z.warning = 0.992, 
     z.delete = FALSE,
     closure = 1440)

And these are the warning messages received:

Warning: Column no. 4 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 5 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 8 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 10 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Column no. 11 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Warning: Row no. 513 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 1482 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 1503 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 2072 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 2169 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Row no. 2515 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
Error in `[.data.frame`(X.mr, , obs[[npat]]) : undefined columns selected

Expected Behavior:

With z.delete explicitly set to FALSE and a z.warning threshold of 0.992, my expectation was that no columns or rows would be deleted based on the proportion of zeros/unobserved values, and that I would not receive warnings indicating otherwise.

Observed Behavior:

  • Warning messages were received indicating the deletion of columns and rows with more than 80% zeros/unobserved values, contrary to the z.delete = FALSE setting.
  • An error occurred related to undefined columns selected, which might be a result of these unexpected deletions.

Additional Context:

  • My dataset includes a considerable amount of zero values, and retaining columns/rows with high proportions of zeros is crucial for my analysis.
  • I'm concerned that the deletion of these columns/rows could impact the integrity and outcome of my analysis.

I would greatly appreciate any guidance on why these deletions and warnings are occurring despite the z.delete setting, as well as any advice on resolving the issue or if there's a potential bug in the function handling.

Thank you for your time and assistance! :)

Thanks @ken1th, can you please confirm that are you using the latest version on CRAN? I very recently updated it to fix an issue related to those arguments.

Moreover, a reproducible example would help if possible.

Hi @Japal, thank you so much for your prompt reply! Yes, I'm using the newest version:

> packageVersion("zCompositions")
[1] ‘1.5.0.2’

A synthetic dataset is also attached.

Edited:
I met some issues when I tried to attach the file in the comment, so I have uploaded the file to https://github.com/ken1th/lrEM_issue/blob/main/synthetic_for_lrEM_issue.csv

Thanks @ken1th, thanks again for spotting this. The issue was related to the internal call to multRepl() that your choice of arguments makes. The settings for z.warning/z.delete at the top level were not being transferred to multRepl() internally, so the messages were actually produced by this latter function when finding zeros over the default threshold.

It is sorted now and the revised version 1.5.0-3 is already available here on Github for you to install, and it has been also sent to CRAN where should be available in a few days.

Moreover, a couple of comments:

  1. You do not need to set closure = 1440 as your data are already closed (you should delete that bit from the command line above).
  2. You might be forcing multivariate imputation too much here? You have several columns that are nearly all zeros. You do not have complete samples to inform the imputation model parameter settings. I can imagine your choice of settings is probably related to the aim of just getting the function up and running, but you should consider if it really makes sense to apply imputation, or multivariate imputation in particular, to this data set. For example, you will see that after the fix (and after removing the closure = 1440 as noted) the function works. But I found it not converging to a solution after having it running for several minutes (I stopped the process in R, but I would not be surprised if the maximum number of iterations is reached after a while or a computation error is thrown). If really wanting to have this data set imputed, maybe one of the univariate alternatives in zCompositions will be a more convenient option, even if still debatable given the very large number of zeros in some variables. Probably a meaningful expert-driven pre-selection of variables and working with a reduced data set that keeps, as much as possible, relevant information would be something to consider.

@Japal Thank you so much for investigating and fixing this so fast!!!! That's impressive! And thank you for the advice, I will have a discussion with my teammates.

@Japal I met the following error after running the function to the synthetic dataset. Is it because of the converging issue you mentioned in the previous reply?

Error in lm.wfit(x, y, w, method = "qr") : incompatible dimensions
In addition: Warning message:
In multRepl(X.old[misspat == npat, , drop = FALSE], label = NA,  :
  Column no. 2 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 3 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 4 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 5 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 6 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 7 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 8 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delete. Check out with zPatterns()).
Column no. 9 containing >99.2% zeros/unobserved values found (see arguments z.warning and z.delet [... truncated]

@ken1th, probably so. As said, given those columns containing so many zeros in your data set, operations based on conditional expectations are probably facing some numerical issues at some point and making the routine fail. Apparently a zero matrix is produced at some point. If you insist in getting those imputed, one of the column-by-column procedures would probably survive technically.

Thanks for your reply, @Japal ! I guess we may just group some variables together to reduce the proportion of zeros.

Yes, a meaning amalgamation of variables may be a way out and I would recommend you to consider this.