Is there any way to filter out words that aren't appropriate in a professional setting?
matthewjones555 opened this issue · 5 comments
The dictionary contains words that are not appropriate in a professional setting. Here are examples that I have found so far:
rectum
foreskin
vagina
testical
penis
scrotum
While being perfectly fine anatomical words, they are entirely inappropriate in a professional setting. I really don't think I'd be able to get HR on board with the idea that "Vagina penis scrotum" is an acceptable password.
Please tell me there's an easy way to filter out all of these words, along with any others that I may have missed. Sadly, this is a total dealbreaker.
Just an FYI, my team have found more words that aren't suitable. This is what we have so far...
var excludedWords = new HashSet<string>(new[]
{
"rectum",
"foreskin",
"vagina",
"testical",
"penis",
"ovary",
"uterus",
"clitoris",
"urethra",
"prostate",
"testis",
"glans",
"scrotum",
"nipple",
"mammary",
"areola",
"anus",
"dick",
"fanny",
"cocaine",
});
I can see that some effort has been taken to remove words from the dictionary. Some notable missing words are:
sphincter
coitus
vulva
gusset
porn
finger
I do understand that this is a very difficult thing to achieve. I'm currently trying to filter these words myself, and it's not going to be a perfect solution. I suppose this is the problem when you're not in direct control of the source material.
Sorry I am not directly answering the question but wanted to provide a perspective here.
Also I agree that there is a great need for a work-safe passphrase generator (I work in schools and would love to see something similar), and I definitely support more discussion on this hence feeling the need to comment here.
-
Based on your first post it seems like you are thinking of this as a password generator. It's not; it is a passphrase generator. It's not sensible to use it to generate three-word phrases, and definitely don't trust the password strength box in Keepass. I am not sure under what unlikely circumstances the password "Vagina penis scrotum" would ever be generated?
-
You are always free to make your own dictionary (although I suspect you will read this page and reconsider :)
-
I may be totally wrong about this, but I would suspect that what you are asking is theoretically impossible, on the following grounds.
(a) It is impossible to determine a list of words that is universally NSFW. There are hundreds of thousands of words to look through and decide upon. If this work is done by a human, which is the most effective means, even then a Doctor may see a word differently to a Christian minister, who may see a word differently to an Islamic minister, who may see a word differently to a school teacher. An alternative approach might be to try to acquire a dictionary of unsafe words. This is a non-starter; anything that can be downloaded as a set means (even an exclusion set) can vastly reduce entropy for the reason that set of words was downloadable in the first place. A second potential alternative is you might introduce AI to do this job (i.e. determine the difference between "kneecap" and "hymen"), you are vastly reducing password entropy, making it unsuitable for use as a secure password generator. Note that entropy is already decreased by the need to place words within a grammatically correct structure. (I may be wrong; length of phrase has a significant effect on entropy so maybe this can be blown out of the water with relative ease. But then the user needs to have some intuitive idea about how secure a 6 word phrase is and reducing dictionaries using AI will affect that. Interested to hear from others on this.)
(b) This is a passphrase generator as opposed to a word generator, so even if you were able to agree a list of words that are unsuitable it would not prevent seemingly rude or inappropriate phrases like "he stacked her wibble gently upon the bishop". Context is almost impossible to predict.
Thanks for your thoughtful reply @hazymat.
I completely understand that this is a very complex problem with many facets. I've worked my way around it for now by manually filtering the dictionary.
As I observed in my second post, there's clearly some filtering already being done, as there are some notable omissions from it that would be considered NSFW.
I don't know what the dev team would have to say about this. I understand that this isn't an easy one to fix, from the point of view of someone consuming their API, the only options available to me right now are some dirty hacks. I was hoping someone might chime in and say something like, "oh you're doing it wrong, you just need to call these methods which will filter it like this". Well, a guy can hope, can't he? :-)
Hi @matthewjones555 and @hazymat,
Thanks for raising this issue. It is a complex one without a solution that will satisfy everyone.
The policy I have taken from the beginning is that anatomical words are OK, but derogatory words are not (and I will not give examples of such words). I may not have been consistent in applying that policy, and over the last 10 years the meaning of words may have shifted such that new words are considered taboo in your context (eg: I'm surprised finger is taboo). But that's what I've aimed for.
Note that your list of "missing" words may simply be words I simply haven't manually added yet, rather than words I have deliberately excluded.
The best technical solution I can offer is to use the API or Console app with a custom dictionary. You don't need to create a new dictionary from scratch, just take the current public dictionary, remove anything you deem NSFW, and use that instead of the default.
PS > .\PassphraseGenerator.exe -d MyCustomDictionary.xml.gz
Actually, it's even easier than that - just replace dictionary.xml.gz
with your modified dictionary.
From c# just use LoadDictionary()
and point it to your custom dictionary. Here's an example from the console app, and from makemeapassword. You could also misuse the mutator interface to analyse generated phrases and potentially sanitse them (or scrub them entirely) - but that would be much more complex.
Given you are talking about a professional context, I assume you (or someone within your company) can control distribution of software, including the passphrase generator, and establish an approved list of taboo words. Simply publish a version with your santised dictionary and you're good to go! (And, if you're being really paranoid, block access to makemeapassword.ligos.net).
A note about the impact on entropy: removing ~50 words will make negligible difference. The generator counts possible combinations based on the phrase description to derive entropy, see Combination Counting for all the details. So you can compare actual figures for the standard and sanitised dictionaries. Having spent many hours manually adding words (and slowly watching the entropy increase), I can assure you that removing 50 words is a rounding error.
A note about randomness: as @hazymat pointed out, there's always the possibility of a "suggestive" phrase, even if you remove taboo words. Your example of "finger" is really good - depending on context "finger" could be perfectly innocent (eg: "the finger scratched his nose" is a possible phrase) or highly suggestive (example left as an exercise for the reader). The generator errs on the side of randomness, rather than "niceness". However, you could misuse a custom mutator to detect suggestive phrases (good luck!!)
Let me know if you have any further questions or comments.
Murray
PS: note the "dev team" is singular. That is, it's just me!
I'm facing similar issue in our company. I found this project https://github.com/danielmiessler/SecLists but there missing for example lastly generated password "A4Masochists" 😄
//EDIT: sorry for mistake. There is no indecent wordlist but maybe here
https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
or
https://github.com/MauriceButler/badwords/blob/master/regexp.js