A solution to the problem of finding five English words with 25 distinct characters, as posed in this video by Matt Parker: https://www.youtube.com/watch?v=_-AfhLQfb6w


Introduction

I came to the same algorithm everyone else did independently, but I've added a few things that help (mainly #6). My python code uses the same algorithm as the c++, but has better comments and is easier to read.

  1. Depth first search. Pick one word at a time, check if it collides with previous word.
  2. Use bitmasks to represent the words for faster detection if a word reuses a letter.
  3. When picking the next word, only check words that are using the lowest bit (i.e. letter) missing. A tiny hitch is that you have to keep track of which letter you get to skip.
  4. Store each word in separate lists based the lowest bit. By separating them, you only have to check one list of words to use, dramatically reducing how many words to check at each layer of the depth search.
  5. Instead of putting the bits in alphabetic order, sort letters/bits based on their frequency in all words. When combined with the above steps, this makes the search start with the hard letters to use. Otherwise, you'll start with a word like learn which eats up tons of easy letters making it hard to find words that need them.
  6. As in step 4, also separate lists based on if they've used the most common letters. This is equivalent to essentially precaching bitmask collision detections. Out of 5977 unique bitmasks, with steps 1-5, you completely skip and never check 409 words. With this step, 2230 are never checked directly. How many of the top bits to use affects performance. Too many and you waste time fumbling through lists. Too few and you miss out on the sweet caching opportunities. In my code, this is controlled by theletter_buckets variable/constant. Experimentally, 4-7 is pretty good. Actual best is language and hardware dependent.
  7. Multithreading! I'm proud to say that my single threaded algorithm can hold it's own against oisyn/parkerwords which uses multithreading. But there's no reason this algorithm can't be multithreaded as well. I completely stole oisyn's implementation here, so I can't claim that as my own. It was eerie reading their code and seeing how similar it was to mine.

Example sorted word table into separate buckets. Rows are the least common letter used. Columns are which most common letters they don't use. This table only uses the top 3 bits. My c++ seems to prefer 6, and python likes 5. Other peoples' implementations effectivly use 0 and collapse each row into 1 bucket.

sea ea sa a se e s {}
q moqui
quick
10 more
hdqrs
qophs
9 more
cequi
coque
16 more
queys
quest
2 more
antiq
faqir
20 more
qaids
qiyas
10 more
aequi
caque
9 more
sequa
j bijou
brujo
30 more
djins
fujis
20 more
bejig
benjy
43 more
ejusd
jebus
15 more
anjou
arjun
62 more
hajis
jacks
14 more
bejan
djave
19 more
hajes
jades
7 more
x boxty
bronx
25 more
nyxis
oxids
8 more
bemix
beryx
67 more
boxes
coxes
24 more
acrux
adfix
47 more
axils
axons
5 more
adnex
axile
23 more
axels
axers
6 more
z blitz
bortz
29 more
grosz
liszt
12 more
benzo
bezil
41 more
bizes
cozes
14 more
arzun
azido
58 more
azons
czars
12 more
adoze
adzer
39 more
adzes
fazes
8 more
v bovid
bovld
48 more
divus
intsv
23 more
bevil
bevor
92 more
coves
dives
32 more
admov
alvin
69 more
alvus
amvis
27 more
above
aevum
65 more
avens
avers
12 more
f clift
comfy
101 more
bumfs
coifs
77 more
befit
befog
93 more
chefs
clefs
36 more
adolf
afgod
94 more
afros
alifs
50 more
afire
afley
43 more
alefs
cafes
9 more
w blowy
blown
83 more
blows
brows
63 more
below
bewig
87 more
bowse
brews
43 more
ablow
adown
111 more
aswim
awols
57 more
alowe
awber
41 more
askew
awest
9 more
k bikol
birky
120 more
bilks
birks
105 more
becky
biked
98 more
becks
bikes
49 more
aking
akron
130 more
adusk
amoks
61 more
ackey
acker
45 more
alkes
asked
9 more
b bichy
bidry
117 more
bhuts
bilos
62 more
becry
becut
110 more
belis
belts
40 more
abdom
abhor
123 more
abysm
abris
37 more
abend
abide
55 more
abets
abies
7 more
g ching
chung
121 more
chugs
clogs
67 more
chego
cheng
94 more
doges
dregs
30 more
acing
agony
110 more
agios
agism
45 more
aegir
agend
44 more
aegis
agers
9 more
p chimp
chirp
104 more
chips
chops
80 more
chelp
cypre
92 more
copes
dopes
40 more
acoup
adopt
99 more
alisp
aphis
52 more
adept
ayelp
38 more
adeps
aesop
9 more
m chimu
chirm
75 more
chums
comus
49 more
celom
chime
74 more
cymes
comes
35 more
admin
admit
85 more
adsum
alums
37 more
admen
ahmed
37 more
acmes
ahems
8 more
h child
chino
66 more
chins
chits
47 more
cheir
chert
51 more
chest
choes
27 more
achor
ahind
65 more
alish
arish
35 more
ached
achen
23 more
aches
ashed
7 more
y cydon
cindy
40 more
cyrus
cloys
34 more
cedry
ceryl
37 more
cosey
desyl
16 more
acidy
acryl
42 more
acyls
ayins
23 more
acedy
aiery
12 more
ayens
asyle
3 more
d cloud
contd
23 more
clods
cords
29 more
cetid
cider
39 more
cedis
codes
21 more
acold
acrid
38 more
acids
adios
22 more
acned
acred
21 more
aides
andes
5 more
c clint
cloit
18 more
cions
clons
21 more
ceint
cento
20 more
ceils
celts
16 more
acoin
acorn
23 more
acost
actus
15 more
acier
acone
13 more
acies
acnes
4 more
u rutin
tourn
3 more
inust
irous
13 more
enrut
intue
13 more
etuis
euros
11 more
alout
altun
13 more
ainus
alnus
10 more
alenu
aleut
6 more
aures
salue
2 more
n inrol
intro
nilot
insol
instr
5 more
eloin
enrol
7 more
elsin
enols
7 more
aloin
altin
7 more
airns
anils
7 more
aeron
alien
7 more
aeons
anise
3 more
t lirot islot
isort
2 more
eliot
lerot
liter
islet
iters
3 more
ariot
latro
2 more
airts
alist
4 more
alert
aliet
3 more
aotes
arest
2 more
l loris oiler liers
lores
aliso
arils
orals
ariel
leora
aisle
aloes
arles
o ireos orias arose
i aesir
r
s
e
a

Future optimizations

  • Use mmap to load the file. I'm on Windows so I can't, but I hear it can be an additional 10% speedup.
  • Multithread the file parsing too. There's 370k words in the words list. If thread overhead isn't too much, this is easily parallelable.
  • Get someone like stew675 or miniBill to implemented it along side all of their other hyper micro optimizations. hint hint, nudge nudge, wink wink.
  • I suspect that ordering my top bits based on frequency alone may not be optimal. Just like the optimal wordle strategy should consider the combined effect of multiple guesses together instead of each guess individually.