stekhoven/missForest

Changing the order of columns containing a different number of NAs changes the imputation results

Opened this issue · 0 comments

Hi,

Thank you for your very useful package,

Here is a code that shows the problem mentioned in the title

library(missForest)

set.seed(82) 

df <- missForest::prodNA(iris, noNA = 0.1)

table(is.na(df$Sepal.Length))
table(is.na(df$Sepal.Width))

df <-df[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")]

seed<-1684751298

set.seed(seed = seed)

df_imputed<-missForest(xmis = df)$ximp

mean(df_imputed$Sepal.Length) 

This gives us 5.833959

Then, if we run exactly the same code, but first change the order of the columns as follows

df <-df[,c("Sepal.Width","Sepal.Length","Petal.Length","Petal.Width","Species")]

The mean result is 5.836454

Given that the 2 interchanged variables do not contain the same number of NAs (15 vs 12), and that imputation must follow an ascending order of the number of NAs in the columns, why is there a change?

Thank you very much for your reply