berzerk0/Probable-Wordlists

List not filtered properly

interlocuteur opened this issue · 5 comments

The 258Million list has not been filtered properly. It contain a lot of HTML tags like and .

Guess this one slipped by me, do you have a specific example?

It's possible these are legitimately being used as passwords - but that's very unlikely.

I don't have the file anymore but you can search for angled brackets "<" and ">"

This is tricky.
I can't be sure of the origin of those lines - they might be both html tags and passwords.

For Release 2.0, I erred on the side of inclusivity.

Their are lines that look a lot like code, specifically html tags. The same is true for some email addresses. In many cases, these lines appeared in over 15 files in analysis, suggesting they are in fact passwords. This logic is not definitive, however.

All of the source files on the list were already published, so this information is already available to the internet. With this in mind, I opted to include these lines. Most questionable lines do not appear until the list is already quite large.

This issue will remain open and we'll meditate upon it.

Troy Hunt's take on the problem.

Of course, it's possible people actually used these strings as passwords but applying a bit of Occam's Razor suggests that it's simply parsing issues upstream of this data set.

Frankly though, there's little point in removing a few million junk strings. It reduced the overall data size of [Troy's Pwned Passwords V2] by 0.69% and other than the tiny fraction of extra bytes added to the set, it makes no practical difference to how the data is used.

While it is highly likely that these aren't passwords, the very idea that they are not is based on assumption we have a good handle on what passwords are. This assumption, for the most part, is true.

However, INTENTIONALLY making passwords that don't look like passwords isn't without merit. I once worked at a company where we had reason to believe that keyloggers were installed on our systems. I had no idea what to with this information, but it really bothered me. To cope with this, I came up with an idea to use the on-screen keyboard to create a password that looked like a URL.

Certainly, I can't be the only one to come up with the idea of making a password that contains some sort of camouflage. It is still most definitely more likely that these are simple "upstream parsing" issues, including them has such a small impact on list performance. I say they are worth keeping.