De-duplicate items

Question

De-duplicate items

Closed this issue 7 years ago · 14 comments

#Looks like there could be quite a few dupes in here, for instance, "password" is at 1 and 19: https://github.com/berzerk0/Probable-Wordlists/blob/master/Real-Passwords/WPA-Length/Top76-probable-WPA.txt

Answer 1 · 2017-04-26T03:12:02.000Z

Good project though!

Would love to see a list of WPA-formatted passwords that come just from router/wifi sources, not user-passwords.

Answer 2 · 2017-04-26T03:36:31.000Z

Duplication - this is me getting caught with the classic invisible newline between windows and Linux.
Rev 1.1 will have this fixed in the main files, the Chunk files will take longer.

WPA-formatted sources - I have found Wordlists that include "WPA" in the title, but that isn't much of a guarantee that they exclusively come from router/wifi sources.

It is also possible (and equally not possible, as I am asserting this with zero evidence) that the trends for common passwords do not change dramatically if they are used for a Router or for an email address. It seems just as likely to me that people see it as a generic "password" rather than "the Wifi password."

I'll see if I can find some sources with more background, but I have doubts.

EDIT
Of course, today I went somewhere where the Guest Wifi password was "wireless guest"

Answer 3 · 2017-04-26T07:03:20.000Z

Easy fix for the dupes that worked for me was issuing:%s^M\+ in vim to kill the trailing blank space artifacts from windows, and then issuing uniq -u passfile.txt > cleanpassfile.txt. Cool project.

Answer 4 · 2017-04-26T08:58:07.000Z

@WiseNerd So if you already fixed it, why not make a PR?

Answer 5 · 2017-04-26T11:10:13.000Z

PR from me shortly for de-dupe. Great work.

Answer 6 · 2017-04-26T11:31:45.000Z

@iancnorden You're gonna beat me to the punch!
I have the desktop chugging away, but won't be back to upload changes for a half day or so

Answer 7 · 2017-04-26T11:57:14.000Z

Now it's a race! I had not realized the size, Git clone is still chugging away!

Answer 8 · 2017-04-26T17:45:54.000Z

@blobgo well my macbook's limited ddr2 memory would be neutered by sanitizing that entire thing, I fixed a small part mostly out of curiosity. But was hoping to save somebody some time nonetheless :)

Answer 9 · 2017-04-26T18:08:54.000Z

De-dupes still running.

Answer 10 · 2017-04-26T19:17:15.000Z

Initial De-Dupes (up to ~30 Million Non-Spec and WPA) are done, looks like I can't do the big ones in parallel - probably done by tomorrow.

Or so I thought, they didn't come out right.

@WiseNerd I was using

awk '!seen[$0]++' hasDupes > doesntHaveDupes

which I assumed started at the top and worked its way down, but then for one of the files it popped "password" out of the 2nd slot. No way.

uniq

only works if two lines are next to one another, unfortunately.

I might just have to compile again from sources - unless @iancnorden 's experience comes up with a solid de-duping

Answer 11 · 2017-04-26T19:35:48.000Z

Chewing on the folder with Top2Bill*

164/958 completed, started around 1400 eastern.

If curious, thanks to https://github.com/ltdenard ... and this will have to continue overnight at this rate.

for f in ls -lha .| tail -n+4 | awk '{print $10}'; do sort -u ${f} > /tmp/tmp1 && mv /tmp/tmp1 ./${f}; done;

Answer 12 · 2017-04-27T17:42:16.000Z

Can all unique combinations be put into a new file, or do you just want the duplicates removed?

Answer 13 · 2017-04-27T18:05:47.000Z

For Rev 1.1 we aim to just remove the duplicates while otherwise preserving order.
The "duplicates" are likely illusory, where there probably are invisible newline characters splitting them up.
This has some effect on overall accuracy once they have been removed.

Rev 2.0 will have the newlines weeded out at the source, so this problem will not carry over.

Answer 14 · 2017-05-11T15:57:06.000Z

De-Duped Rev 1.1 is live now, but does not contain the largest files.

Rev 1.2 will, in torrents with compression.

Closing this in light of the release of 1.1 and the impending release of 1.2