from Guide to Hacking on Jan 7, 2024
Why Wordle dies in 2028
You've probably heard of Wordle. It's a popular puzzle game that boomed in popularity back in late 2021 and early 2022. Little did you know though, the game has a limited lifespan.
Wordle was built to last 6 years
Wordle is a word puzzle game, where the goal is to guess a single 5-letter word, within 6 tries. Each try, the letters in your guess are highlighted green, yellow, or gray. These colors mean "correct letter in correct position", "correct letter in wrong position" or "letter not included".
Here's the problem though: There are only so many 5-letter words in the English dictionary, and each day, there's one Wordle puzzle. This means the number of possible words dictates how many days Wordle can run for.
- The original source code for Wordle included a fixed answer list according to redditor u/mucow. This list has 2,315 words, which lasts for just over 6 years.
- According to Wikipedia, Wordle launched in October 2021, which is about 2 years ago, leaving us with 4 years of runway left. This sets the death date for Wordle in spring of 2028.
Granted, I'm being a bit dramatic. If Wordle lasts in the spotlight for that long, its publisher will likely just start repeating words. However, let's continue this thought experiment: Is Wordle really doomed to run out of words?
Expand the list of possible answers
One way to handle this, naturally, is to simply expand the number of words we can possibly include as Wordle solutions. There are a few different options already, without making up random words.
Option #1. Use guesses as answers. The original Wordle shipped with a list of possible 5-letter guesses. There are an additional 10,657 words outside of the answers list, which would significantly extend the runway of the game.
Initially, New York Times appeared to take this approach by creating a new 14,855-word answers list, but in in late 2022, began to curate Wordle by hand so that "the game stays focused on vocabulary that's fun, accessible, lively and varied".
If we dig into that last answers list, we'll see why New York Times switched to manual curation: Even the first few lexicographically-sorted words contain some wacky ones, like abmho, a unit of electrical conductance, and aalii, a species of bush in Hawaii. These are too uncommon to include as Wordle answers.
Option #2. Use popular dictionary additions. It turns out there isn't a clear way to define how many words there are in the English language, much less how many new words are being introduced1.
As a rough proxy, we can look at the number of new dictionary entries. From 2017 to 2023, Webster added 690 words per year (250, 840, 640 + 530, 535, 520 + 455, 370, 690). Also from 2017 to 2023, the Oxford English Dictionary added 620 words per year 2:
- 535 = 137 + 182 + 216
- 1,183 = 240 + 134 + 292 + 349 + 168
- 621 = 157 + 261 + 203
- 712 = 162 + 147 + 14 + 94 + 18 + 160 + 117
- 320 = 151 + 10 + 159
- 622 = 155 + 198 + 113 + 156
- 358 = 186 + 101 + 71
Although some words such as astraphobia, the fear of lightning, are esoteric, many other new additions aren't. For example, allyship is a common term, which is likely more familiar with even existing words in the dictionary such as "phooo". So, new words could be a source of common 5-letter words too.
8.18% of words in Webster's dictionary are 5 letters long. If we assume this percentage stays constant, then the 650 or so new entries added per year would introduce 53 new 5-letter words per year. This extends our runway by at most a few months, from spring to summer of 2028.
Option #3. Rebuild our list. Above, we noted that new additions to the dictionary may be popular enough to include as a Wordle solution. This goes for existing words in the dictionary too: Maybe there are other words we've glossed over.
- As an example, a reddit u/hi_fiver found that the original Wordle answers list excluded the word "laser".
- Gerund forms of words or other conjugations could be included as well, such as "doing" or "owing".
So, there's a chance there are other commonly-used words that could be re-added. Let's dive deeper into this possibility: What if we rebuilt our Wordle list from scratch?
What qualifies for a Wordle word?
Let's rebuild the list of possible Wordle answers from scratch. Before we do that though, are there insights from previous strategies, for automatic list design?
- According to Wikipedia, Wordle-creator Wardle's partner Shah trimmed down the list of words from 13,000 to 2,000 manually. This made the original list very subjective in its curation.
- The current strategy is not much different, as the New York Times assigned an editor to curate daily Wordle word selections. In this way, the current Wordle is also done manually.
So, unfortunately, prior work doesn't help us here, with automatic word selections. Let's create a set of rules to enforce automatically instead.
- Common word. The Wordle solution should be a relatively un-surprising word, so the game is accessible to a wide audience — also so that you go "doh!" when you miss the word. Luckily, this challenge is fairly addressable: pick commonly-used words.
- No "s" or "es" plural forms. According to the New York Times, the answer will never be plural forms that end in "s" or "es". However, it may take on other plural forms. According to them, "the answer will never be FOXES or SPOTS, but it might be GEESE or FUNGI".
- No offensive or sensitive words. According to Screenrant, who likely found the information from u/randybruder on reddit, New York Times removed several words from the Wordle solutions list when it acquired the puzzle. We'll consider this canon.
Let's now apply these criteria towards building a new solutions list. To start, we'll need a list of words to begin whittling down.
Find lists of words
Turns out getting just a list of "all" words isn't the most trivial. However, there are a number of possibilities.
Approach #1. Random online list. A list of 10,000 most common words from a random MIT user profile contains 1,379 5-letter words, but words include uncommon proper nouns like "zdnet". Although there are commonly-used proper nouns we'd like to keep, let's first try a brute force removal of all proper nouns using NLTK's part-of-speech (POS) tagger.
import nltk
import urllib.request
nltk.download('averaged_perceptron_tagger')
def read(fname):
with open(fname) as f:
return f.read().splitlines()
def report(name, words, dst):
step2 = [word for word in words if len(word) == 5]
step3 = [word for word, tag in nltk.tag.pos_tag(words)
if tag not in ('NNP', 'NNPS')]
print(f"[{name}] {len(words)} => {len(step2)} => {len(step3)}")
with open(dst, 'w') as f:
for word in step2:
f.write(f"{word}\n")
return set(step2)
urllib.request.urlretrieve('https://www.mit.edu/~ecprice/wordlist.10000', 'raw-mit.txt')
mit = report('mit', read('raw-mit.txt'), 'out-mit.txt')
MIT = report('MIT', [w.title() for w in read('raw-mit.txt')], 'out-MIT.txt')
Running the above gives us the following outputs.
[mit] 10000 => 1379 => 1376
[MIT] 10000 => 1379 => 29
This means NLTK doesn't filter out proper nouns very well at all. In the first case, NLTK filters out just 0.002% of the dataset as-is, reducing 1379 to 1376 words — retaining "zdnet". In the second case, I capitalize every word before applying the POS tagger, and now, 97%+ of all words are filtered out, reducing 1379 to 29 words — eliminating words like "wrong". Equally unhelpful.
Approach #2. Official UNIX words. Unix and Unix-based operating systems come with a list of 235,000+ words used by spell-checking programs, found at /usr/share/dict/words
. Reuse the script above.
unix = report('Unix', read('/usr/share/dict/words'), 'out-unix.txt')
Running the above script gives us the following outputs.
[Unix] 235976 => 10239 => 8333
This list contains 10,239 5-letter words, and because proper nouns are often capitalized, NLTK filters this list down to 8,333. Unfortunately, the first few words "aalii", "abaff" are still completely unfamiliar, and I'm additionally not sure if these words are real per se. I also still don't trust the NLTK POS tagger, given its capitalization sensitivity.
Approach #3. Use a dictionary. Project Gutenberg makes Webster's unabridged dictionary freely available. There are a few tricky gotchas to be aware of: Every word in the CSV is denoted in all caps, on its own line, but certain entries are suffixes or prefixes. Additionally, different conjugations of words will be included in a bracket statement right underneath the word. The script below handles these cases.
def read_websters(fname):
words = set()
pos_line = False
for line in read(fname):
if pos_line:
line = remove_etymology(line)
words.update(extract_plural(line))
words.update(extract_conjugations(line))
pos_line = False
if is_valid_entry(line):
pos_line = True
words.add(line)
return words
Running the script above, we'll see the following.
[websters] 99047 => 6425
According to the output above, the Gutenberg dictionary contains 5,954 5-letter words, or 6,425 if you include conjugations. Both numbers exclude simple plurals that end in "s".
Taking a look at the outputted words, this list still includes some unheard-of words, such as "buffo" and "furze", but there also other promising words missing from Wordle — popular proper nouns such as "paris" and "aries," and other vaguely word-like words such as "unhat" and "unwit".
This looks promising, and I'm convinced that we can either start from Webster's 6,425-word list or use this to filter other lists. Let's now use word frequencies to assemble our final list.
Find word frequencies
Google Ngrams and Google Trends already determine word popularity in books and search respectively. Unfortunately, neither offer an API, so this is mostly a dead end — save for sketchy third-party libraries strewn across the web.
Option #1. Free samples from a paid resource.****Fortunately, the Corpus of Contemporary American English (COCA) has already computed the most frequent words from 1 billion words gathered across books, Wikipedia, and other sources across the internet.
- They de-duplicated this list in a few different ways and made the top 5,000 of each list free on its samples page4.
- Across all of COCA's list samples, there are only 1,458 5-letter words. The most common words include "about", "their", "would" and the least common in this list include "could", "these", "thing". This seems about right for commonly-used words, so quality is verified.
This list of 1,458 is much shorter than Wordle's original 2,315, but luckily, the least-common words on this list look fairly common. That means we can expand our list further.
After using more of COCA's sample data3 COCA has now contributed a total of 1,917 5-letter words5. The COCA-based list is still shorter than the original Wordle list, so let's move on to our next option.
Option #2. Use a free dataset. Turns out a number of different researchers have collated word frequency lists or "popular word" lists from large corpuses already.
-
Peter Norvig, Director of Research at Google, computed his own frequency lists from a trillion-word corpus. From the 333,333 most frequent words, there are 4,300 5-letter words.
- Just 1,683 words occurring over 1,000,000 times in the dataset. The least common words here are "abbot", "ovary", "thyme", and "pence", but a few less-common words litter the list — "tutti", "seton", "lathe".
- At around the 1,400-word mark, the list is looking much more Wordle-worthy, with all the words now looking extra familiar — "creme", "glide", "pivot", "sinus", "lilac" and more.
def read_norvig(path):
words = []
with open(path) as f:
for line in csv.reader(f, delimiter='\t'):
word, count = line
word = word.lower()
if len(word) == 5 and word in websters:
words.append(word)
if int(count) < 1000000 or len(words) > 1400:
break
return words
urllib.request.urlretrieve('https://norvig.com/ngrams/count_1w.txt', 'raw-norvig.txt')
norvig = report('norvig', read_norvig('raw-norvig.txt'), 'out-norvig.txt')
Running the script above, we'll see the following.
[norvig] 1401
-
Donald Knuth, emeritus Stanford professor, also computed a list of the most popular 5-letter words as part of the Stanford GraphBase. The list of 5,757 words contains odd entries like "spumy" and "gooks", which were removed after filtering with Webster's — reducing the list to 3,195 words.
- At the 2,650 word mark, the list looks half-promising, with the last words being "owlet", "barmy", "catty", "wader", and "durst". Not words I could define off the top of my head, but certainly words I've seen.
- At the 2,300 word mark, the list is better, with the last words being "nutty", "axial", "natal", "clomp", "gored". These are definitely common words that are Wordle-worthy.
def read_knuth(path):
with open(path) as f:
words = []
for line in csv.reader(f):
word = line[0].lower()
if len(words) > 2300:
break
if len(word) == 5 and word in websters:
words.append(word)
return words
urllib.request.urlretrieve('https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt', 'raw-knuth.txt')
knuth = report('knuth', read_knuth('raw-knuth.txt'), 'out-knuth.txt')
Running the script above, we'll see the following.
[knuth] 2301
In sum, we have Norvig's 1,400 5-letter words and Knuth's 2,300. Altogether, both lists together produces a final Norvig-Knuth list of 2,446 5-letter words. This is not bad, but not great. We've extended Wordle's lifetime from spring to summer of 2028, in short.
Wrap it up with a hack
So, we actually don't need to build from scratch. The original Wordle answer list has already been curated; we're just looking for more.
Turns out the 2,446 Norvig-Knuth list contributes 523 new 5-letter words to the original Wordle answer list, making our brand new combined Wordle answer list 2,838 words.
urllib.request.urlretrieve('https://gist.githubusercontent.com/cfreshman/a03ef2cba789d8cf00c08f767e0fad7b/raw/45c977427419a1e0edee8fd395af1e0a4966273b/wordle-answers-alphabetical.txt', 'wordle-answers.txt')
answers = set(read('wordle-answers.txt'))
print(f"[Wordle] original: {len(answers)}")
v2 = answers | knuth | norvig
print(f"[Wordle] with ours: {len(v2)}")
Running the script above, we'll see the following.
[Wordle] original: 2315
[Wordle] with ours: 2838
This includes words like "diced", "cooky", "molly", "xenon"6. That extends the runway of Wordle by 1.5 years until summer of 2029! 😂
More importantly, through this process, we've found a large number of 5-letter words that could be added to Wordle, through some manual curation of our own. No amount of automatic curation, using word frequency, its existence in a dictionary, POS tags, etc. could produce the "perfect" list. So, it looks like the New York Times was right — you just might need a human curator after all … for now.
-
A number of sources online quote a Harvard-Google study that found 1,022,000 words in the English language, a Global Language Monitor study that found 5,400+ new words per year, and a Quora answer claims 800-1,000 new dictionary entries are added each year. I couldn't find primary sources or clearly-explained methodologies on how these were computed. ↩
-
The Oxford English Dictionary actually added over 2000+ updates per year. However, these updates included new sub-entries and new meanings of existing words. To make this number more comparable, I went to each update and counted only entirely new entries. ↩
-
I additionally download COCA's every-10th-word samples by clicking on the "XLSX" links by each of the 4 samples, then exporting these excel files to CSVs. Across this new corpus, there are 3,918 5-letter words, but this now includes nonsense words like "dirck" and "mantz". Turns out that these words all come from COCA's "Word Forms" list, so we filter that list specifically. Filtering "Word Forms" with the Webster dictionary, the list shortens its contribution from 2,190 5-letter words to just 411. These words include "tutor", "codex" and "auger" — but, they also include words like "indow", "krang", and "ancon". To ensure quality, I'm excluding Word Forms for a total of 1,917 words. ↩
-
Click on any "Download" link to download all 4 samples. The other leading search results appear to come from these samples. ↩
-
I could in theory pay for their dataset, but according to their pricing page, I would need to pay $295. That's a pretty hefty price, so I continued my search elsewhere. To also avoid issues with producing a derivative work using their sample, I trashed this list. ↩
-
Granted, the list needs further curation. There are maybe an obvious 10 words that should be filtered out from this final subset. I won't publish the final list I got — just the code I used to get there. ↩
Want more tips? Drop your email, and I'll keep you in the loop.