sfirke/janitor

Feature request: allow users to opt-in for duplicate names in `make_clean_names()`

JasonAizkalns opened this issue ยท 12 comments

Feature requests

There are times when users may wish to opt-in to allow duplicate names when using make_clean_names(). One approach would be having an additional argument: make_clean_names(..., allow_dupes = FALSE, ...). Here is some support.

I believe a solution would be bypassing the final while loop in the make_clean_names() source (which was introduced in #358):

...
cased_names <- snakecase::to_any_case(made_names, case = case, 
    sep_in = sep_in, transliterations = transliterations, 
    parsing_option = parsing_option, numerals = numerals, 
    ...)
while (any(duplicated(cased_names))) {
    dupe_count <- vapply(seq_along(cased_names), function(i) {
        sum(cased_names[i] == cased_names[1:i])
    }, 1L)
    cased_names[dupe_count > 1] <- paste(cased_names[dupe_count > 
        1], dupe_count[dupe_count > 1], sep = "_")
}
cased_names

So effectively:

...
cased_names <- snakecase::to_any_case(made_names, case = case, 
    sep_in = sep_in, transliterations = transliterations, 
    parsing_option = parsing_option, numerals = numerals, 
    ...)

if (!allow_dupes) {
  while (any(duplicated(cased_names))) {
    dupe_count <- vapply(seq_along(cased_names), function(i) {
      sum(cased_names[i] == cased_names[1:i])
    }, 1L)
    cased_names[dupe_count > 1] <- paste(cased_names[dupe_count > 
                                                       1], dupe_count[dupe_count > 1], sep = "_")
  }
}
cased_names

That's a nicely done feature request. thank you. I think I'm sold. My reaction at first was, can't someone use snakecake::to_any_case ? But looking at the code for make_clean_names() it does a lot of other things first that someone might want.

It's easy to implement, as you show, and there are already so many arguments to that function that what's another one at the end ๐Ÿ˜†

Would you want to write a pull request that implements this new allow_dupes argument? I could answer questions & give support.

Thank you.

My initial reaction was similar -- I actually implemented my workaround with snakecase::to_any_case but to your point, it's not exactly the same and there are many things that clean_names()/make_clean_names() do that I prefer and I believe this makes it a more flexible and complete utility function.

I'll take a crack at a pull request sometime soon. Appreciate the help and support -- might take a few iterations to adhere to standards/docs, so send it back accordingly.

I agree that it makes sense as an option. I agree with @sfirke that while make_clean_names() has a lot of arguments, a few more aren't a problem. The main reason is that make_clean_names() is more of a detail-focused function, so having lots of controls makes sense. (The opposite is true for clean_names() which is more general use, so fewer direct arguments are preferred.)

@JasonAizkalns Thanks in advance for the PR!

Some things to look out for: document the new variable, describe the change & give yourself credit in NEWS.md, and add a new test here: https://github.com/sfirke/janitor/blob/main/tests/testthat/test-clean-names.R#L190 Thank you for working on this!

@sfirke, Based on the comment from @JasonAizkalns in #497, what would you think about the (minor) breaking change of using the unique_sep to do the de-duplication of names which would accomplish this goal and remove some code from janitor?

The cost is that it's a breaking change where the numbering of columns would be one lower (what is currently a_2 would become a_1).

@JasonAizkalns, thanks for proposing what feels like a great solution! @sfirke, I do think that it's the better solution.

@JasonAizkalns, can you please revise the PR code to use the unique_sep argument with make_clean_names() and remove the while loop? Some of the deduplication tests will also need to be revised, and please document it as a breaking change in the NEWS file.

FYI we had some discussion about unique_sep vs the while-loop a couple of years ago, I am re-linking here in case any of those points feel salient: #251 (comment)

Ah, @billdenney can you tell me more about your thoughts here: #251 (comment) ?

I don't remember the exact thoughts that I had. Based on what I wrote, I'm guessing that I was concerned that duplicate names may still slip through, but I don't know why I had that concern.

I just did a quick check, and I don't think using it would be a problem. I don't know if I checked it in the past and there was an issue or if it was concern about a possible bug if we didn't do a more extensive duplicate check.

So, I think let's go for it. I think that the tests should capture any issues, and if there are other issues that pop up, we can handle them either by reinstating a similar while loop or pushing the fix upstream.

Thanks Bill and Jason! I will review this week ๐Ÿ–๏ธ