Feature request: allow users to opt-in for duplicate names in `make_clean_names()`

Question

Feature request: allow users to opt-in for duplicate names in `make_clean_names()`

JasonAizkalns opened this issue 2 years ago · 12 comments

Feature requests

There are times when users may wish to opt-in to allow duplicate names when using make_clean_names(). One approach would be having an additional argument: make_clean_names(..., allow_dupes = FALSE, ...). Here is some support.

I believe a solution would be bypassing the final while loop in the make_clean_names() source (which was introduced in #358):

...
cased_names <- snakecase::to_any_case(made_names, case = case, 
    sep_in = sep_in, transliterations = transliterations, 
    parsing_option = parsing_option, numerals = numerals, 
    ...)
while (any(duplicated(cased_names))) {
    dupe_count <- vapply(seq_along(cased_names), function(i) {
        sum(cased_names[i] == cased_names[1:i])
    }, 1L)
    cased_names[dupe_count > 1] <- paste(cased_names[dupe_count > 
        1], dupe_count[dupe_count > 1], sep = "_")
}
cased_names

So effectively:

...
cased_names <- snakecase::to_any_case(made_names, case = case, 
    sep_in = sep_in, transliterations = transliterations, 
    parsing_option = parsing_option, numerals = numerals, 
    ...)

if (!allow_dupes) {
  while (any(duplicated(cased_names))) {
    dupe_count <- vapply(seq_along(cased_names), function(i) {
      sum(cased_names[i] == cased_names[1:i])
    }, 1L)
    cased_names[dupe_count > 1] <- paste(cased_names[dupe_count > 
                                                       1], dupe_count[dupe_count > 1], sep = "_")
  }
}
cased_names

Answer 1 · 2022-12-01T16:46:30.000Z

That's a nicely done feature request. thank you. I think I'm sold. My reaction at first was, can't someone use snakecake::to_any_case ? But looking at the code for make_clean_names() it does a lot of other things first that someone might want.

It's easy to implement, as you show, and there are already so many arguments to that function that what's another one at the end 😆

Would you want to write a pull request that implements this new allow_dupes argument? I could answer questions & give support.

Answer 2 · 2022-12-01T17:09:57.000Z

Thank you.

My initial reaction was similar -- I actually implemented my workaround with snakecase::to_any_case but to your point, it's not exactly the same and there are many things that clean_names()/make_clean_names() do that I prefer and I believe this makes it a more flexible and complete utility function.

I'll take a crack at a pull request sometime soon. Appreciate the help and support -- might take a few iterations to adhere to standards/docs, so send it back accordingly.

Answer 3 · 2022-12-01T17:10:59.000Z

I agree that it makes sense as an option. I agree with @sfirke that while make_clean_names() has a lot of arguments, a few more aren't a problem. The main reason is that make_clean_names() is more of a detail-focused function, so having lots of controls makes sense. (The opposite is true for clean_names() which is more general use, so fewer direct arguments are preferred.)

@JasonAizkalns Thanks in advance for the PR!

Answer 4 · 2022-12-01T18:16:31.000Z

Some things to look out for: document the new variable, describe the change & give yourself credit in NEWS.md, and add a new test here: https://github.com/sfirke/janitor/blob/main/tests/testthat/test-clean-names.R#L190 Thank you for working on this!

Answer 5 · 2022-12-02T15:02:05.000Z

@sfirke, Based on the comment from @JasonAizkalns in #497, what would you think about the (minor) breaking change of using the unique_sep to do the de-duplication of names which would accomplish this goal and remove some code from janitor?

The cost is that it's a breaking change where the numbering of columns would be one lower (what is currently a_2 would become a_1).

Answer 6 · 2022-12-02T15:04:49.000Z

If you both feel good about it I can get behind the breaking change. Thanks both for the thoughtfulness here.

…

On Fri, Dec 2, 2022, 10:02 AM Bill Denney ***@***.***> wrote: @sfirke <https://github.com/sfirke>, Based on the comment from @JasonAizkalns <https://github.com/JasonAizkalns> in #497 <#497>, what would you think about the (minor) breaking change of using the unique_sep to do the de-duplication of names which would accomplish this goal and remove some code from janitor? The cost is that it's a breaking change where the numbering of columns would be one lower (what is currently a_2 would become a_1). — Reply to this email directly, view it on GitHub <#495 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZYDEEACYPRE6GRT5UVS7LWLIFPTANCNFSM6AAAAAASQ5OPWQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 7 · 2022-12-02T15:23:00.000Z

@JasonAizkalns, thanks for proposing what feels like a great solution! @sfirke, I do think that it's the better solution.

@JasonAizkalns, can you please revise the PR code to use the unique_sep argument with make_clean_names() and remove the while loop? Some of the deduplication tests will also need to be revised, and please document it as a breaking change in the NEWS file.

Answer 8 · 2022-12-02T16:31:26.000Z

FYI we had some discussion about unique_sep vs the while-loop a couple of years ago, I am re-linking here in case any of those points feel salient: #251 (comment)

Answer 9 · 2022-12-02T16:43:08.000Z

Ah, @billdenney can you tell me more about your thoughts here: #251 (comment) ?

Answer 10 · 2022-12-02T17:27:09.000Z

I don't remember the exact thoughts that I had. Based on what I wrote, I'm guessing that I was concerned that duplicate names may still slip through, but I don't know why I had that concern.

I just did a quick check, and I don't think using it would be a problem. I don't know if I checked it in the past and there was an issue or if it was concern about a possible bug if we didn't do a more extensive duplicate check.

So, I think let's go for it. I think that the tests should capture any issues, and if there are other issues that pop up, we can handle them either by reinstating a similar while loop or pushing the fix upstream.

Answer 11 · 2022-12-02T18:59:29.000Z

Cool. I amended my PR. Added additional tests too for NULL and non-NULL. I think we may get a happy consequence too — small speed boost.

Answer 12 · 2022-12-05T15:47:36.000Z

Thanks Bill and Jason! I will review this week 🖐️