ameenmaali/urldedupe

Eliminate duplicates that are not in query strings

Opened this issue · 8 comments

Hi,
thanks a lot for this tool, it is very useful!

I was wondering if it would be possible to implement also a dedupe functionality for this kind of URL:

  • /product/1/buy/1
  • /product/1/buy/2
  • /product/1/
  • /product/2/

This should results just in:

  • /product/1/buy/1
  • /product/1/

It seems to me that at this time this is not taken in consideration.

I would really like to contribute on this by myself but my C++ knowledge are really rusty :)

Thanks again!

This functionality would probably need to be added as a switch, since, if we consider somesite.com/product/1 and somesite.com/product/2 to be the same, we would also need to consider
somesite.com/something/login and somesite.com/something/profile to be the same.
Anything else would need some heuristic for determining if the difference in the path is relevant or not, which would be complicated, error-prone and (probably) slow.

I'd propse adding a switch to the program to tell it which part of the path is irrelevant, like so:
-i 2 would make ex.com/product/1/whatever the same as ex.com/product/x/whatever
-i 2 -i 3 would make ex.com/product/1/2 the same as ex.com/product/x/y

Yeah, I can understand.

I think it could be feasible if it is checked only in case there is a digit.

1 is likely similar to 2 so one of them could be discarded.

Two different strings have probably different meaning so they can be kept.

Thanks for the suggestion @simonebovi! I wanted to release this in the initial release but didn’t have much time to get it done. I have started working on this already actually - I’m focusing on integer differences only, as anything else would be very hard without context and most likely cause more issues than it would solve unless done extremely well

Just added a PR to account for this. It checks for common assets to ignore (images, fonts), as well as integers in URLs. Tested it out and seems to be working well. @larskraemer, if you want to take a look: #9

PR Merged, feel free to give it a test @simonebovi!

Thanks a lot @ameenmaali,
I really love open source projects!

It seems to work way better now.

I just did some tests and I have found that this can be improved again to me.

For example, these URLs are still maintained:

  • product/1/buy/1
  • product/1/buy/2
  • product/1/buy/3

These are basically the same so I think only one of them should be kept.

However a URL like that should be kept as well:

  • product/1/sell/1

Not sure if this is feasible though.

What do you think?

Thank you, I will check this out shortly. This should already be accounted for - may be a bug

Hey @simonebovi, just tried to test this out, and it seems to be working as expected. Can you try again to verify? Make sure to enable the -s flag when running. One thing I do know is an issue is a lack of checking for ports in the URL so that may cause 2 of the same (with different or missing port numbers) to show up. I will add that in a future update (#10)