[R] R WORDLE tool concepts vary

Fri Feb 25 00:01:32 CET 2022

This forum often discusses different ways of looking at a problem that result in often very different R formulations. Many may be RIGHT in a sense but some clearly have tradeoffs or work only in limited circumstances or come from some limited way of thinking. Note this message got to be a bit long, so feel free to skip it.

With this in mind, I saw the following article mentioned on R-bloggers about a sort of helper function that can be used as a way to identify what your next possible moves might be in the popular new game called WORDLE. The following article explains it well if interested:

https://www.r-bloggers.com/2022/02/wordle-solve-wordle-with-r/

I am not sure who the author is and clearly you can solve the problem in many ways using many languages. I looked at the code they used to learn from it. It is fairly compact and depends on another package that is used to look up potential words to see if they exist.

  library(hunspell)

  black <- ""

  exclude <- function(local, global = black) {
    LETTERS[!LETTERS %in% c(local, global)]
  }

  wordle <- function(one, two, three, four, five, include = "") {
    words <- apply(expand.grid(one, two, three, four, five), 1, \(x) paste(x, collapse = ""))
    words <- words[grepl(paste0("(?=.*", include, ")", collapse = ""), words, perl = TRUE)]
    words[hunspell_check(words)]
  }

Rather than explaining everything again, each move in this game is a guess of a five letter word. The feedback is to tell you which letters in your guess are NOT in the word of the day and which, if any, are correct and in the right position and which are correct but in a wrong position. Based on the feedback, you try another 5-letter word that may illuminate you further. You get six tries. I do not generally need help and always guess it in no more than 4-5 and several times in 3 and once, weirdly, in 2 tries. But it can be helpful to get a list of what words remain potentially valid and thus this function.

In English, the code above accepts inputs that boil down to which of the 26 uppercase letters are allowed at this point in column ONE, and another set for two through five. It also uses info (one oddly kept in a global variable) that keep track of which letters are known not to be in the word at all as well as another holding those known to be there at least one time.

The first line can be translated into English:

  words <- apply(expand.grid(one, two, three, four, five), 1, \(x) paste(x, collapse = ""))

Basically it makes a potentially huge data.frame containing all combinations of words. Early in the game that may mean something in the neighborhood of 20 possible letters for as many as five positions which can mean creating a data.frame on the order of millions of rows. The five columns are then stitched together into the potentially millions of 5-letter strings. UGH!

The next line of code uses a regular expression to test that all letters KNOWN at that point to be in the answer MUST  be present and any other filtered out. 

  words <- words[grepl(paste0("(?=.*", include, ")", collapse = ""), words, perl = TRUE)]

Finally, the other package is asked to look up tons of remaining words and return those deemed to be TRUE words.

  words[hunspell_check(words)]

Although I appreciated the terseness of the code, I was appalled, especially at how long this ran at times. It seems too brute force! Why make all possible 5 letter words first, rather than select only 5 letter words that match a pattern?

My solution, in English, was to make a regular expression dynamically that looked like:

  "^[cond1][cond2][cond2][cond4][cond5]$"

With appropriate actual conditions, the above would match only five-letter words. If you said that that the first character was the letter "S" the condition was written as [S] and if you had a vector of possible letters like c("A", "E", "I", "O", "U") the condition would obviously be [AEIOU] and finally my version of the exclude function above added a caret symbol as in [^AEIOU] to mean the negation.

On top of that, why search the entire dictionary rather than one with just 5 letter words? On my machine, it searches a file called en_US.dic with 49,271 lines/words that also includes additional info on some words like:

  zucchini/MS
  zwieback/M

So the first thing I did was run the above code as if the game had just begun and saved the results in a character vector called dict. The length is down nicely to 6494 five-letter words. And the technique is now not to search the entire set of possible words potentially repeatedly in that larger framework, but to match one word at a time for a pattern and keep the results. This runs much quicker, even if the word list is now saved as a file. Here is my first attempt and I am wondering if this approach has positives or negatives I have not considered/

  ##### WORDLE checker

  ## Find available WORDLE words given stated conditions
  ## assuming "dict" has been already initialized to
  ## contain all legal five-letter words.

  ## Function to gather all possible letters to exclude,
  ## either because they cannot be in the current column
  ## or in any part of the word. The letters selected
  ## are preceded by the ^ symbol to mean negation later.

  xcl <- function(local, global = black) {
    lett <- paste0(c(local, global), collapse="")
    toupper(paste0("^", lett, sep=""))
  }

  ## Function to make a part of a regular expression containing
  ## the letters handed in between [] but note the first component
  ## will sometimes be a caret.

  re_prepare <- function(lettervec) {
    paste0("[", toupper(paste0(lettervec, collapse="")), "]", sep="")
  }

  ## Finally, a function that takes conditions describing what
  ## letters may be used (or not used) and more, and generates
  ## a regular expression that selects possible 5-letter words, 
  ## then removes any that do not have known letters and returns
  ## the result. Normally the result gets auto-printed unless diverted.

  wordle2 <- function(one="", two="", three="", four="", five="", include = "") {
    one.pat <- re_prepare(one)
    two.pat <- re_prepare(two)
    three.pat <- re_prepare(three)
    four.pat <- re_prepare(four)
    five.pat <- re_prepare(five)

    pat <- paste0("^", one.pat, two.pat, three.pat, four.pat, five.pat, "$", sep="")
    words <- grep(pat, dict, value=TRUE)
    words <- words[grepl(paste0("(?=.*", include, ")", collapse = ""), words, perl = TRUE)]
    return(words)
  }

Back to me. If I run the following:

  black <- c("O", "R", "A", "F", "L", "U", "T", "P")

  wordle2(
    one = "S",
    two = "W",
    three = "I",
    four = xcl("S"),
    five = "E",
   include = c( "E", "S", "W", "I")
  )

It generates:

  [1] "SWINE"

Of course this would be late in the game. Earlier you get many potential answers and it remains up to the player to use one that helps them solve the puzzle by gathering more data and then updating some of the above variables and calling wordle2 again.

I am SURE plenty of others looking at these two solutions, will find a way to do it yet another way, or automate other parts like asking you at the end of each round, what the results showed, and then updating the variables like the above so the next call is trivial!

What, I wonder, is the thinking process? The anonymous sharer of the first version clearly knew about the grep family of functions as they used it. But they may have relied on the concept of building on a dictionary checker and thus headed down another path. I used grep twice. But what about generality and expansion?

Has anyone made a WORDLE variation that does exactly 4 letters or perhaps 7 letters? The method used starts to fail if you need to create a data.frame with N columns containing as many as 26**N rows when N is a number like 7 and we are as high as 8 trillion rows! Brute force does not scale up well. And asking to search a dictionary with even a fraction of that many possible words could take hours or more. 

My solution just makes a slightly longer Regular Expression with seven components even using the standard dictionary. It is simple enough that I assume it aborts a possible match as soon as any of the letters viewed do not fit as there is no conditional nature to the expression so that it never tries multiple ways as it is anchored to the start (and end). The number of items it tries is limited to the size of the dictionary no matter how many letters long the words being looked for..

To end this monologue and invite discussion, I repeat that I am not suggesting anyone play WORDLE or create or use tools like this one. I am curious about ways people think and solve such problems. I am open to better ways.

Avi