tidyverse/forcats

`fct_lump_prop()` has slight inconsistency between documentation and code

kjytay opened this issue · 4 comments

The documentation for fct_lump_prop says that it "lumps levels that appear fewer than prop * n times", where n is the length of the factor variable. However, levels that appear exactly prop * n times are lumped. For example:

x <- factor(c(rep(1, 6), rep(2, 3), rep(3, 1)))
fct_lump_prop(x, prop = 0.3)

In the example above, the level 2 appears exactly 0.3 of the time and hence should not be lumped.

There are two possible solutions. The first is to amend the documentation, but that would result in inconsistent behavior between fct_lump_prop and fct_lump_min. The other solution would be to amend the code for fct_lump_prop, in particular this line:

new_levels <- ifelse(prop_n > prop, levels(f), other_level)

The > should be replaced with >=.

It seems this one is my fault - I'd blame it on being a tidyverse developer day but it's a mistake nonetheless.

It doesn't appear there was a test for the exact match case previously, and the new test implemented reflects the behaviour you note; that levels with prop*n occurrences are lumped, which would imply the documentation is incorrect (and has an extra "in") but that would indeed be slightly inconsistent with fct_lump_min(), though perhaps that is better than a breaking change.

I'm happy to submit a PR resolving this if a maintainer would like to nominate a preferred solution.

Another apparently undocumented feature is that fct_lump_min() does nothing when there is only one lumping factor, something that does make sense to me, but, i feel it is inconsistent with, i.e. fct_other(), which does not complain about. Perhaps a kindly message("factor is unchanged because only one level would have been lumped") ?

There doesn't seem to be much reason to choose between < and <= so leaving the behaviour as is and changing the documentation would be a lower risk fix. @jonocarroll are you still interested in doing a PR?

Yes, I'll put something together.