Wordlists don't contain Non-ASCII Characters

Question

Wordlists don't contain Non-ASCII Characters

berzerk0 opened this issue 7 years ago · 3 comments

Americans aren't the only ones with passwords - why not have special wordlists that include non-ASCII Characters?

I'm glad you asked.

As my knowledge level increases so does my ability to sort out lines. I have two methodologies that I will put to use for Rev 2.0

1. Grep out passwords containing characters from different alphabets

If there is an alphabet published in unicode on Wikipedia, I plan to grep for it

The Ukranian Alphabet is different than the Russian, which is different than the Belorussian, which is different than the Common Cyrillic, which is different than the Serbian which is different than...
This means we could have NATIONALLY targeted lists based on predominant languages
This isn't only true for Cyrillic-based alphabets. Dano-Norwegian is a different alphabet than Swedish, English... etc.
At the very least by language family
My sources still bias towards English, so the ASCII-only lists may simply dwarf the others, but they should still be available.

2. Make Sub-set lists based on source name.

I have many sources with "Rus", "ru", and "Russian" in the title. These lists contain are presumably from Russian sources - so perhaps they should be amalgamated themselves.
Some sources are obviously geared towards WPA, etc.
Caveat: Since my methodology is based on approximating accuracy using the number of files a given line appears in, these groups made of sub-set sources are likely to be precise, but inaccurate. An analogy would be me throwing darts. I might be landing them within a circle of less than 1", but the target is about 4ft over to the left.

In actuality, I'm awful at darts.

I welcome any suggestions - except on my darts game. I mean suggestions about the wordlists.

Answer 1 · 2017-06-07T15:01:23.000Z

Hey again,

Not sure if this has had much thought or updates, but I believe unicode.com upholds the 'official' characters lists that can be rendered or utilized from other alphabets... such as punicode to unicode.
Good example:
https://unicode-table.com/en/#cyrillic

I believe these are sourced from: https://github.com/unicode-table/unicode-table-data which may have good data on a per-language or per character set to base an initial push from.

Answer 2 · 2017-06-07T17:41:04.000Z

Great find! I still plan on implementing this.

As a status update on this and Rev 2 generally, I have found plenty of sources and need to do a bit of sifting before repeating the process. I'd say Mid-July is a generous estimate for Rev 2 - meaning it may be sooner than that.

Answer 3 · 2018-02-20T23:32:25.000Z

"Mid July" haha.

The lists now contain non-ascii characters.