descr does not calculate statistics (e.g. min, max) correctly if the column names contain exactly the same postfixes as the statistics function string (e.g. "column_min" or "column_max")
yenchiayi opened this issue · 0 comments
I have a small data.frame with dimension = (2, 3) as follows:
column0 | column1 | column2 |
---|---|---|
1 | 11 | 21 |
2 | 12 | 22 |
The descr function calculates everything correctly if I set column names as c("x", "x_1", "x_2"):
df <- data.frame(
x = 1:2,
x_1 = 11:12,
x_2 = 21:22
)
df %>%
summarytools::descr(stats = c( "min", "max", "n.valid", "skewness", "kurtosis"))
x | x_1 | x_2 | |
---|---|---|---|
Min | 1.00 | 11.00 | 21.00 |
Max | 2.00 | 12.00 | 22.00 |
N.Valid | 2.00 | 2.00 | 2.00 |
Skewness | 0.00 | 0.00 | 0.00 |
Kurtosis | -2.75 | -2.75 | -2.75 |
However, if I set column names as c("x", "x_min", "x_max"), then descr does not calculate minimum and maximum (as well as other statistics like "n.valid", "skewness", and "kurtosis" ) correctly.
df <- data.frame(
x = 1:2,
x_min = 11:12,
x_max = 21:22
)
df %>%
summarytools::descr(stats = c( "min", "max", "n.valid", "skewness", "kurtosis"))
As seen in below output, the Min of column 2 (x_max) is even larger than its Max. Other statistics like N.Valid, "Skewness", and "Kurtosis" are also wrong for the column "x_max" and "x_min".
x | x_max | x_min | |
---|---|---|---|
Min | 1.00 | 21 | 1 |
Max | 2.00 | 2 | 1 |
N.Valid | 2.00 | 1 | 1 |
Skewness | 0.00 | NA | NA |
Kurtosis | -2.75 | NA | NA |
My preliminary guess is that the the program may fail to distinguish the column name postfix (e.g. x_min) and the function name (e.g. min). I found that this issue arises around line 367-373 In descr.R. You may check this and see what happens.
Thanks!