A solution to the problem of finding five English words with 25 distinct characters, as posed in this video by Matt Parker: https://www.youtube.com/watch?v=_-AfhLQfb6w
I came to the same algorithm everyone else did independently, but I've added a few things that help (mainly #6). My python code uses the same algorithm as the c++, but has better comments and is easier to read.
- Depth first search. Pick one word at a time, check if it collides with previous word.
- Use bitmasks to represent the words for faster detection if a word reuses a letter.
- When picking the next word, only check words that are using the lowest bit (i.e. letter) missing. A tiny hitch is that you have to keep track of which letter you get to skip.
- Store each word in separate lists based the lowest bit. By separating them, you only have to check one list of words to use, dramatically reducing how many words to check at each layer of the depth search.
- Instead of putting the bits in alphabetic order, sort letters/bits based on their frequency in all words. When combined with the above steps, this makes the search start with the hard letters to use. Otherwise, you'll start with a word like
learn
which eats up tons of easy letters making it hard to find words that need them. - As in step 4, also separate lists based on if they've used the most common letters. This is equivalent to essentially precaching bitmask collision detections. Out of 5977 unique bitmasks, with steps 1-5, you completely skip and never check 409 words. With this step, 2230 are never checked directly. How many of the top bits to use affects performance. Too many and you waste time fumbling through lists. Too few and you miss out on the sweet caching opportunities. In my code, this is controlled by the
letter_buckets
variable/constant. Experimentally, 4-7 is pretty good. Actual best is language and hardware dependent. - Multithreading! I'm proud to say that my single threaded algorithm can hold it's own against oisyn/parkerwords which uses multithreading. But there's no reason this algorithm can't be multithreaded as well. I completely stole oisyn's implementation here, so I can't claim that as my own. It was eerie reading their code and seeing how similar it was to mine.
Example sorted word table into separate buckets. Rows are the least common letter used. Columns are which most common letters they don't use. This table only uses the top 3 bits. My c++ seems to prefer 6, and python likes 5. Other peoples' implementations effectivly use 0 and collapse each row into 1 bucket.
sea | ea | sa | a | se | e | s | {} | |
---|---|---|---|---|---|---|---|---|
q | moqui quick 10 more |
hdqrs qophs 9 more |
cequi coque 16 more |
queys quest 2 more |
antiq faqir 20 more |
qaids qiyas 10 more |
aequi caque 9 more |
sequa |
j | bijou brujo 30 more |
djins fujis 20 more |
bejig benjy 43 more |
ejusd jebus 15 more |
anjou arjun 62 more |
hajis jacks 14 more |
bejan djave 19 more |
hajes jades 7 more |
x | boxty bronx 25 more |
nyxis oxids 8 more |
bemix beryx 67 more |
boxes coxes 24 more |
acrux adfix 47 more |
axils axons 5 more |
adnex axile 23 more |
axels axers 6 more |
z | blitz bortz 29 more |
grosz liszt 12 more |
benzo bezil 41 more |
bizes cozes 14 more |
arzun azido 58 more |
azons czars 12 more |
adoze adzer 39 more |
adzes fazes 8 more |
v | bovid bovld 48 more |
divus intsv 23 more |
bevil bevor 92 more |
coves dives 32 more |
admov alvin 69 more |
alvus amvis 27 more |
above aevum 65 more |
avens avers 12 more |
f | clift comfy 101 more |
bumfs coifs 77 more |
befit befog 93 more |
chefs clefs 36 more |
adolf afgod 94 more |
afros alifs 50 more |
afire afley 43 more |
alefs cafes 9 more |
w | blowy blown 83 more |
blows brows 63 more |
below bewig 87 more |
bowse brews 43 more |
ablow adown 111 more |
aswim awols 57 more |
alowe awber 41 more |
askew awest 9 more |
k | bikol birky 120 more |
bilks birks 105 more |
becky biked 98 more |
becks bikes 49 more |
aking akron 130 more |
adusk amoks 61 more |
ackey acker 45 more |
alkes asked 9 more |
b | bichy bidry 117 more |
bhuts bilos 62 more |
becry becut 110 more |
belis belts 40 more |
abdom abhor 123 more |
abysm abris 37 more |
abend abide 55 more |
abets abies 7 more |
g | ching chung 121 more |
chugs clogs 67 more |
chego cheng 94 more |
doges dregs 30 more |
acing agony 110 more |
agios agism 45 more |
aegir agend 44 more |
aegis agers 9 more |
p | chimp chirp 104 more |
chips chops 80 more |
chelp cypre 92 more |
copes dopes 40 more |
acoup adopt 99 more |
alisp aphis 52 more |
adept ayelp 38 more |
adeps aesop 9 more |
m | chimu chirm 75 more |
chums comus 49 more |
celom chime 74 more |
cymes comes 35 more |
admin admit 85 more |
adsum alums 37 more |
admen ahmed 37 more |
acmes ahems 8 more |
h | child chino 66 more |
chins chits 47 more |
cheir chert 51 more |
chest choes 27 more |
achor ahind 65 more |
alish arish 35 more |
ached achen 23 more |
aches ashed 7 more |
y | cydon cindy 40 more |
cyrus cloys 34 more |
cedry ceryl 37 more |
cosey desyl 16 more |
acidy acryl 42 more |
acyls ayins 23 more |
acedy aiery 12 more |
ayens asyle 3 more |
d | cloud contd 23 more |
clods cords 29 more |
cetid cider 39 more |
cedis codes 21 more |
acold acrid 38 more |
acids adios 22 more |
acned acred 21 more |
aides andes 5 more |
c | clint cloit 18 more |
cions clons 21 more |
ceint cento 20 more |
ceils celts 16 more |
acoin acorn 23 more |
acost actus 15 more |
acier acone 13 more |
acies acnes 4 more |
u | rutin tourn 3 more |
inust irous 13 more |
enrut intue 13 more |
etuis euros 11 more |
alout altun 13 more |
ainus alnus 10 more |
alenu aleut 6 more |
aures salue 2 more |
n | inrol intro nilot |
insol instr 5 more |
eloin enrol 7 more |
elsin enols 7 more |
aloin altin 7 more |
airns anils 7 more |
aeron alien 7 more |
aeons anise 3 more |
t | lirot | islot isort 2 more |
eliot lerot liter |
islet iters 3 more |
ariot latro 2 more |
airts alist 4 more |
alert aliet 3 more |
aotes arest 2 more |
l | loris | oiler | liers lores |
aliso arils orals |
ariel leora |
aisle aloes arles |
||
o | ireos | orias | arose | |||||
i | aesir | |||||||
r | ||||||||
s | ||||||||
e | ||||||||
a |
- Use mmap to load the file. I'm on Windows so I can't, but I hear it can be an additional 10% speedup.
- Multithread the file parsing too. There's 370k words in the words list. If thread overhead isn't too much, this is easily parallelable.
- Get someone like stew675 or miniBill to implemented it along side all of their other hyper micro optimizations. hint hint, nudge nudge, wink wink.
- I suspect that ordering my top bits based on frequency alone may not be optimal. Just like the optimal wordle strategy should consider the combined effect of multiple guesses together instead of each guess individually.