Nothing ever finishes. In 2019, I wrote about solving the Spelling Bee puzzle in J. This was more of an exercise in tacit programming, as it was algorithmically less than ideal. My actual, performant solution is a server application, formerly written in Go, but now available in Haskell. This refactor hinges upon a brand new algorithm, which I figured was worth discussing in an agnostic manner here.
/usr/share/dict/words
. This list is filtered and used to construct a dictionary of valid puzzle answers.This is a fun problem because it's not concerned with how many times a letter appears in a word, only the presence or absence of a character. The puzzle string "baeiouc" is satisfied by both "boa" and "baobab,"1 as each word is represented by the same set
{b,a,o} ⊆ {b,a,e,i,o,u,c}
With this in mind, every word that satisfies "baeiouc" is a set S where
S ⊆ {b,a,e,i,o,u,c} and b ∈ S
Removing the primary letter "b" and tallying 2|{a,e,i,o,u,c}| yields 64 subsets of valid letters for this puzzle, or any puzzle.
The dictionary is a map S → {W1 … Wn}, where S is a set of letters, |S| ≤ 7, and Wn is any word represented by S.
Querying the dictionary with all 64 subsets of a puzzle string produces all valid words. Among these subsets will be a single set with a cardinality of 7, which contains every three point value.
This solution is pretty good from a computational perspective. Every problem is solved in a precise number of iterations—no searching necessary. The words associated with a given subset can also be a single newline-intercalated string, since they will always be printed together. Lookups to the dictionary are also trivially parallelized, though I doubt it's even worth spinning up threads here. All that's required is an efficient way to key/query the map by subsets of letters.
When dealing with a set |S| ≤ 64, simple Int
types should spring to mind. The following pseudocode creates an integer hash from any string using bitwise OR logic.
hash ∷ Text → Hash
hash [] = 0
hash (α:ω) = (hash ω) | (1 << (charToInt α % 64))
In plain English: for every letter in a word, get the integer representation of that character and modulo it by 64 to get the nth bit in an integer. Set that bit with an OR operation. The final integer is hash representing the set of letters for the word.
Calculating this hash for every word and then grouping the words by common integers produces the map.
The 64 subsets of a puzzle string don't need to exist at once in memory. They can be dynamically generated in a recursive procedure, and looked up in the dictionary immediately. More pseudocode below.
query ∷ Int → Hash → Hash → [Text]
query n prim sec
| sec == 0 = []
| firstBitSet? sec = (lookup newPrim) : (query m prim newSec) ++ (query m newPrim newSec)
| otherwise = query m prim newSec
where m = n + 1
newSec = sec >> 1
newPrim = prim | (1 << n)
Maybe mixing Haskell and C syntax like this just makes everybody unhappy. So, in English.
prim
be the hash of the primary letter in the puzzle string.sec
be the hash of the 6 secondary letters in the puzzle string.n
be an integer representing the currently focused bit during a given iteration of query
. In the initial call to query
, n
is 0.sec
, one bit at a time, incrementing n
during each recursion. Do this until all 6 active bits in sec
have been encountered.n
in sec
is active, then let newPrim
be prim
with its n
th bit set. Look up newPrim
in the dictionary and cons the results against a split recursion: one branch of query
called with newPrim
, and another called with the old prim
.n
in sec
is inactive, simply recurse to the next bit without doing anything.This yields a list of 64 strings containing all valid puzzle answers. The very last entry in the list will be the 3 point values that contain all 7 letters.
Please refer to the linked Haskell repository for the actual implementation. The code may make more sense there, but it's also more complicated. This is a lot faster than some earlier stabs I've taken at the problem in 2018, but there may be more chances for optimization: using unboxed arrays instead of a list, exploring alternatives to Text
, seeing where laziness helps/hurts queries, etc. There are also plenty of nonsense queries in the 64 subsets, like consonant-only strings. Perhaps these can be avoided entirely through more intelligent hashing.