Killer bee

Nothing ever finishes. In 2019, I wrote about solving the Spelling Bee puzzle in J. This was more of an exercise in tacit programming, as it was algorithmically less than ideal. My actual, performant solution is a server application, formerly written in Go, but now available in Haskell. This refactor hinges upon a brand new algorithm, which I figured was worth discussing in an agnostic manner here.

The problem

There are seven letters: one primary letter and six secondary ones.
The player has to use these letters to make words. A letter can be used more than once.
Every word must be five letters or more.
Every word must contain the primary letter.
A valid word is worth one point. A valid word containing all seven letters is worth three points.

The input

A plaintext file of newline-delimited words, like the one found at /usr/share/dict/words. This list is filtered and used to construct a dictionary of valid puzzle answers.
A seven letter string, such as "putinae", where the first letter "p" represents the primary letter, and the following six "utinae" represent the secondary letters. These characters are run against the dictionary to fetch all words that satisfy the problem.

The solution

This is a fun problem because it's not concerned with how many times a letter appears in a word, only the presence or absence of a character. The puzzle string "baeiouc" is satisfied by both "boa" and "baobab,"¹ as each word is represented by the same set

{b,a,o} ⊆ {b,a,e,i,o,u,c}

With this in mind, every word that satisfies "baeiouc" is a set S where

S ⊆ {b,a,e,i,o,u,c} and b ∈ S

Removing the primary letter "b" and tallying 2^{|{a,e,i,o,u,c}|} yields 64 subsets of valid letters for this puzzle, or any puzzle.

The dictionary is a map S → {W₁ … W_n}, where S is a set of letters, |S| ≤ 7, and W_n is any word represented by S.

Querying the dictionary with all 64 subsets of a puzzle string produces all valid words. Among these subsets will be a single set with a cardinality of 7, which contains every three point value.

Dictionary implementation

This solution is pretty good from a computational perspective. Every problem is solved in a precise number of iterations—no searching necessary. The words associated with a given subset can also be a single newline-intercalated string, since they will always be printed together. Lookups to the dictionary are also trivially parallelized, though I doubt it's even worth spinning up threads here. All that's required is an efficient way to key/query the map by subsets of letters.

When dealing with a set |S| ≤ 64, simple Int types should spring to mind. The following pseudocode creates an integer hash from any string using bitwise OR logic.

hash ∷ Text → Hash
hash []    = 0
hash (α:ω) = (hash ω) | (1 << (charToInt α % 64))

In plain English: for every letter in a word, get the integer representation of that character and modulo it by 64 to get the nth bit in an integer. Set that bit with an OR operation. The final integer is hash representing the set of letters for the word.

Calculating this hash for every word and then grouping the words by common integers produces the map.

Querying the dictionary

The 64 subsets of a puzzle string don't need to exist at once in memory. They can be dynamically generated in a recursive procedure, and looked up in the dictionary immediately. More pseudocode below.

query ∷ Int → Hash → Hash → [Text]
query n prim sec
  | sec == 0         = []
  | firstBitSet? sec = (lookup newPrim) : (query m prim newSec) ++ (query m newPrim newSec) 
  | otherwise        = query m prim newSec
    where m       = n + 1
          newSec  = sec >> 1
          newPrim = prim | (1 << n)

Maybe mixing Haskell and C syntax like this just makes everybody unhappy. So, in English.

Let prim be the hash of the primary letter in the puzzle string.
Let sec be the hash of the 6 secondary letters in the puzzle string.
Let n be an integer representing the currently focused bit during a given iteration of query. In the initial call to query, n is 0.
Iterate through sec, one bit at a time, incrementing n during each recursion. Do this until all 6 active bits in sec have been encountered.
If bit n in sec is active, then let newPrim be prim with its nth bit set. Look up newPrim in the dictionary and cons the results against a split recursion: one branch of query called with newPrim, and another called with the old prim.
If bit n in sec is inactive, simply recurse to the next bit without doing anything.

This yields a list of 64 strings containing all valid puzzle answers. The very last entry in the list will be the 3 point values that contain all 7 letters.

Conclusion

Please refer to the linked Haskell repository for the actual implementation. The code may make more sense there, but it's also more complicated. This is a lot faster than some earlier stabs I've taken at the problem in 2018, but there may be more chances for optimization: using unboxed arrays instead of a list, exploring alternatives to Text, seeing where laziness helps/hurts queries, etc. There are also plenty of nonsense queries in the 64 subsets, like consonant-only strings. Perhaps these can be avoided entirely through more intelligent hashing.

Notes

Ignore the 5 letter limit here.