NAs in metadata$Corr_matrix

Question

NAs in metadata$Corr_matrix

hopkinsjj9 opened this issue 5 years ago · 8 comments

Thank you for putting together a great package!

I'm getting infinite or missing values in 'x' errors when I try to send the following data through the process:
https://www.kaggle.com/pradeeptripathi/predicting-house-prices-using-r/data

train <- data.frame(readr::read_csv('../data/train.csv'))
str(train)
train <- train %>% mutate_if(is.character,as.factor)
str(train)

cleaned <- missCompare::clean(train,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

make sure
cleaned <- missCompare::clean(cleaned,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

metadata <- missCompare::get_data(cleaned,
matrixplot_sort = T,
plot_transform = T)
Warning message:
In stats::cor(X, use = "pairwise.complete.obs", method = "pearson") :
the standard deviation is zero

simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Error in eigen(if (doDykstra) R else Y, symmetric = TRUE) :
infinite or missing values in 'x'

I found two NAs in metadata$Corr_matrix. Utilities/LotFrontage
Not knowing exactly how to handle this, I just set them to zero (hack)

colnames(metadata$Corr_matrix)[colSums(is.na(metadata$Corr_matrix)) > 0]
metadata$Corr_matrix[is.na(metadata$Corr_matrix)] <- 0

I can now restart at the simulate step
But, there's got to be a better way
Shouldn't clean or get_data take care of this somehow?

Thanks again
Jack Hopkins

Answer 1 · 2019-09-20T15:49:20.000Z

Hi - I will look into this issue early next week. Indeed it sounds like this is a bug and this should be handled inside one of the functions. Thanks for the heads up!

Answer 2 · 2019-09-25T11:10:50.000Z

Hi - Checked your problem. The problem here is that when calculating the correlation matrix, two features (Utilities and LotFrontage) produce NAs. The reason for this is that the feature Utilities has very small variance in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459 instances and takes the value 2 in only 1 instance). I don't have a quick fix for you in terms of missCompare, but you can solve this for now by removing the Utilities column from the data. This is a cleaning step that should be done before the get_data() step, of course.
Perhaps in the next version I can include some command for such cases in the clean() function.
Good luck with your analysis!

Answer 3 · 2019-09-25T12:53:33.000Z

Thank you for looking into this. I started using another dataset which was able to get past this problem, only to run into another one. post_imp_diag performs a T-test which will break if a column only contains 1 NA y variable. My solution ( Oh no! ) was to just use the median in those cases. It allowed me to check out the diagrams coming out of post_imp_diag with a minimum of impact. (I hope). I appreciate your feedback and hope to see more great packages. Jack Hopkins

…

On Wed, Sep 25, 2019 at 7:10 AM Tibor V. Varga ***@***.***> wrote: Hi - Checked your problem. The problem here is that when calculating the correlation matrix, two features (Utilities and LotFrontage) produce NAs. The reason for this is that the feature Utilities has very small variance in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459 instances and takes the value 2 in only 1 instance). I don't have a quick fix for you in terms of missCompare, but you can solve this for now by removing the Utilities column from the data. This is a cleaning step that should be done before the get_data() step, of course. Perhaps in the next version I can include some command for such cases in the clean() function. Good luck with your analysis! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=ADGAYBWIMMDZVA5KFRYMMOTQLNBLVA5CNFSM4IYYMMMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RQOGY#issuecomment-534972187>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGAYBQA3Z3G4UWTRLDADGLQLNBLVANCNFSM4IYYMMMA> .

Answer 4 · 2019-10-01T08:38:50.000Z

Hi Jack - could you clarify the statement "post_imp_diag performs a T-test which will break if a column only contains 1 NA y variable." and include an example? Does the problem occur when there is only 1 NA amongst the values of a variable? Having troubles with the "1 NA y variable".
Thanks,
Tibor

Answer 5 · 2019-12-03T23:59:00.000Z

Hello Tirgit,

I have a question, I am trying to do "impute_simulated", but I don't to do all the 16 MI methods, I want to choose some of them, can I do that.

Thanks,

Ahmad

Answer 6 · 2019-12-10T15:01:06.000Z

Hi Ahmad,

This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option.

For now, though, you can do this using impute_data().

Best,
Tibor

Answer 7 · 2019-12-10T15:04:39.000Z

Thanks for your replay, then I will wait for next version :) Kind regards, Ahmed R. Al-Saber Ph.D. Candidate (University of Strathclyde) CEO & Founder Advancement Consulting for Statistical Studies (ACS) m. +965 97703330 w. acs-kw.com <http://www.acs-kw.com/> s. Shayma Tower Floor 10 | Murgab, Block 3, Plot 8A+8B. Omar Bin Al-Khattab Street, Kuwait p. PO Box - 5819, Kuwait City, Safat 13059. <https://www.instagram.com/acs_kw/> <https://twitter.com/acs_kw> <https://www.facebook.com/StatisticalConsultancyKuwait> <https://www.linkedin.com/company-beta/13309568/> If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Read more Think before you print.

…

On Dec 10, 2019, at 3:01 PM, Tibor V. Varga ***@***.***> wrote: Hi Ahmad, This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option. For now, though, you can do this using impute_data(). Best, Tibor — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AJMRZPQDR76X5T5T5AMW4CLQX6VLFA5CNFSM4IYYMMMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPREQQ#issuecomment-564073026>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMRZPWPQ7CLK3Y2W7VBBD3QX6VLFANCNFSM4IYYMMMA>.

Answer 8 · 2021-07-19T22:51:58.000Z

Thank you for making this package! I have data (N ~ 13000) that is highly missing, monotone, and MNAR (for Gender (~10%) and Ethnicity (~80%)). I converted all chr features to fct, created the cleaned and metadata objects, and everything worked fine, -- but then when I tried to create the simulated object, I got the error written in the header. I'm a little confused because other users here attribute that error to having NAs in their data, but I thought that the 'simulated' step removes the NA values for you, and basically normalizes your initial dataset. Have I misunderstood?