Tirgit/missCompare

NAs in metadata$Corr_matrix

hopkinsjj9 opened this issue · 8 comments

Thank you for putting together a great package!

I'm getting infinite or missing values in 'x' errors when I try to send the following data through the process:
https://www.kaggle.com/pradeeptripathi/predicting-house-prices-using-r/data

train <- data.frame(readr::read_csv('../data/train.csv'))
str(train)
train <- train %>% mutate_if(is.character,as.factor)
str(train)

cleaned <- missCompare::clean(train,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

make sure
cleaned <- missCompare::clean(cleaned,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

metadata <- missCompare::get_data(cleaned,
matrixplot_sort = T,
plot_transform = T)
Warning message:
In stats::cor(X, use = "pairwise.complete.obs", method = "pearson") :
the standard deviation is zero

simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Error in eigen(if (doDykstra) R else Y, symmetric = TRUE) :
infinite or missing values in 'x'

I found two NAs in metadata$Corr_matrix. Utilities/LotFrontage
Not knowing exactly how to handle this, I just set them to zero (hack)

colnames(metadata$Corr_matrix)[colSums(is.na(metadata$Corr_matrix)) > 0]
metadata$Corr_matrix[is.na(metadata$Corr_matrix)] <- 0

I can now restart at the simulate step
But, there's got to be a better way
Shouldn't clean or get_data take care of this somehow?

Thanks again
Jack Hopkins

Hi - I will look into this issue early next week. Indeed it sounds like this is a bug and this should be handled inside one of the functions. Thanks for the heads up!

Hi - Checked your problem. The problem here is that when calculating the correlation matrix, two features (Utilities and LotFrontage) produce NAs. The reason for this is that the feature Utilities has very small variance in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459 instances and takes the value 2 in only 1 instance). I don't have a quick fix for you in terms of missCompare, but you can solve this for now by removing the Utilities column from the data. This is a cleaning step that should be done before the get_data() step, of course.
Perhaps in the next version I can include some command for such cases in the clean() function.
Good luck with your analysis!

Hi Jack - could you clarify the statement "post_imp_diag performs a T-test which will break if a column only contains 1 NA y variable." and include an example? Does the problem occur when there is only 1 NA amongst the values of a variable? Having troubles with the "1 NA y variable".
Thanks,
Tibor

Hello Tirgit,

I have a question, I am trying to do "impute_simulated", but I don't to do all the 16 MI methods, I want to choose some of them, can I do that.

Thanks,

Ahmad

Hi Ahmad,

This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option.

For now, though, you can do this using impute_data().

Best,
Tibor

Thank you for making this package! I have data (N ~ 13000) that is highly missing, monotone, and MNAR (for Gender (~10%) and Ethnicity (~80%)). I converted all chr features to fct, created the cleaned and metadata objects, and everything worked fine, -- but then when I tried to create the simulated object, I got the error written in the header. I'm a little confused because other users here attribute that error to having NAs in their data, but I thought that the 'simulated' step removes the NA values for you, and basically normalizes your initial dataset. Have I misunderstood?