Dictionary handling _very_ slow.
btimby opened this issue · 5 comments
My first time using the dictionary feature of this library was to use the /usr/share/dict/words dictionary as outlined in the documentation. On my system, this file contains just shy of 0.5M words.
However, if one does this, the validation step takes several minutes as the file is loaded, parsed then searched for the word. This happens for every form submittal. This renders this feature unusable. I don't have time to fix this right now, but one way to handle this is to use a preprocessing step that will convert the dictionary to a searchable form. This can easily be integrated into one's deploy procedure, so that the dictionary is sourced from a plain text file whenever the code is deployed. Optionally, a management command could be added that would perform this pre-processing.
An example of this type of operation can be taken from postfix (the MTA). It uses the postmap command to convert text lists into searchable databases so that the MTA can do a huge number of lookups very quickly.
http://www.postfix.org/postmap.1.html
A further optimization would be to trim words from the dictionary that are shorter than the configured minimal length. Potential passwords shorter than this length are rejected outright so the existence of these words in the dictionary bloats it unnecessarily.
A bit more information about postmap:
Postfix uses bsddb hash to store this data. Here is some sample code to interact with this data.
It should be trivial to create and search this data using python-bsddb3. However, it might also be possible to simly pickle a Set. Loading and searching the pickled set should be fast. Especially if the first validation loads/caches the Set and subsequent validations only search it.
These are good thoughts. Would you mind creating a pull request for this?
What about simply loading the dict into memory as a list if the file is under some arbitrary size ex. 1MB? The way I see this function is that is must return a complete set of dictionary words
def get_dictionary_words(self, dictionary):
with open(dictionary) as dictionary:
return [smart_text(x.strip()) for x in dictionary.readlines()]
So for a small enough dictionary keeping it resident in memory seems appropriate while for a larger file, perhaps an iterator over the file is the best that can be done.
Related follow-up PR: #65