strengejacke/sjmisc

row_means() proportion of datapoints

Opened this issue · 2 comments

row_means() has an argument n which allows us to specify the proportion of values required per row to return a mean. For example, n=.75 in my understanding is supposed to return a mean only if at least 75% of values in that row are non-NA. The following behaviour is therefore contrary to what I expected:

> df<-data.frame(q1=c(1,2),q2=c(2,NA),q3=c(1,1))
> df
  q1 q2 q3
1  1  2  1
2  2 NA  1
> sjmisc::row_means(df,n=.75)
  q1 q2 q3 rowmeans
1  1  2  1 1.333333
2  2 NA  1 1.500000

I had expected the second entry of the rowmeans column to be NA, because only 2 out of 3 values in that column are non-NA, i.e. 67% which is less than 75%. I realize I might be missing something about the intended behaviour of this function.

> packageVersion('sjmisc')
[1] ‘2.8.6’

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
> 3 * .75
[1] 2.25

0.75 of 3 columns is approx. 2 columns. so 75% of 3 columns is closer to 2 than to 3 columns.

Hi strengejacke, thanks for your reply on this issue. I see the logic of rounding but I think in a scientific context we need flooring. When calculating the average of say questionnaire responses in psychology, we want to have a value only if at least X% of the responses are valid. So the current function does not allow that if I understand correctly. I'm wondering if we could add an argument to the function that allows the user to choose between rounding or flooring behaviour when calculating the number of valid responses.