Eliminate duplicates that are not in query strings
Opened this issue · 8 comments
Hi,
thanks a lot for this tool, it is very useful!
I was wondering if it would be possible to implement also a dedupe functionality for this kind of URL:
- /product/1/buy/1
- /product/1/buy/2
- /product/1/
- /product/2/
This should results just in:
- /product/1/buy/1
- /product/1/
It seems to me that at this time this is not taken in consideration.
I would really like to contribute on this by myself but my C++ knowledge are really rusty :)
Thanks again!
This functionality would probably need to be added as a switch, since, if we consider somesite.com/product/1 and somesite.com/product/2 to be the same, we would also need to consider
somesite.com/something/login and somesite.com/something/profile to be the same.
Anything else would need some heuristic for determining if the difference in the path is relevant or not, which would be complicated, error-prone and (probably) slow.
I'd propse adding a switch to the program to tell it which part of the path is irrelevant, like so:
-i 2
would make ex.com/product/1/whatever the same as ex.com/product/x/whatever
-i 2 -i 3
would make ex.com/product/1/2 the same as ex.com/product/x/y
Yeah, I can understand.
I think it could be feasible if it is checked only in case there is a digit.
1 is likely similar to 2 so one of them could be discarded.
Two different strings have probably different meaning so they can be kept.
Thanks for the suggestion @simonebovi! I wanted to release this in the initial release but didn’t have much time to get it done. I have started working on this already actually - I’m focusing on integer differences only, as anything else would be very hard without context and most likely cause more issues than it would solve unless done extremely well
Just added a PR to account for this. It checks for common assets to ignore (images, fonts), as well as integers in URLs. Tested it out and seems to be working well. @larskraemer, if you want to take a look: #9
PR Merged, feel free to give it a test @simonebovi!
Thanks a lot @ameenmaali,
I really love open source projects!
It seems to work way better now.
I just did some tests and I have found that this can be improved again to me.
For example, these URLs are still maintained:
- product/1/buy/1
- product/1/buy/2
- product/1/buy/3
These are basically the same so I think only one of them should be kept.
However a URL like that should be kept as well:
- product/1/sell/1
Not sure if this is feasible though.
What do you think?
Thank you, I will check this out shortly. This should already be accounted for - may be a bug
Hey @simonebovi, just tried to test this out, and it seems to be working as expected. Can you try again to verify? Make sure to enable the -s
flag when running. One thing I do know is an issue is a lack of checking for ports in the URL so that may cause 2 of the same (with different or missing port numbers) to show up. I will add that in a future update (#10)