from Guide to Hacking on Jan 21, 2024

Our AI overlords suck at 2048

2048, a small puzzle game, blew up in popularity almost exactly a decade ago. So, why talk about it now? Well, just recently, one adventurous soul tried to run ChatGPT on 2048, and the subsequent "GPT crushes my high score in 2048.io" Hackernews post was torn apart. In short, rather than showering praise on ChatGPT, everyone lamented that the original poster (OP) was likely just bad at 2048.

"OP is just bad at 2048"
"…are you just not very good at 2048?"
"OP is just super bad at 2048"
- "But he's excellent at clickbait!"

There was a particular set of comments that caught my interest, however. In short, everyone was certain that even random inputs would crush both the poster and ChatGPT.

"Random inputs may get you a higher score than that."
"I just played 2 games randomly pressing all arrow keys with closed eyes and got a score of ~1100 in the first game and 1468 in the second game. OP's AI agent scored 1348"
"A set of random inputs will likely beat 128."

It turns out that both OP and the commenter's were both wrong. To OP's chagrin, ChatGPT is indeed performing poorly, but unlike the commenters' hunches, a random agent mashing all the keys isn't quite enough.

LLMs do poorly

Let's look at our "AI overlords," the Large Language Models that have been popularized in recent years. I plugged in both the state-of-the-art instruction-tuned open-source models¹ and state-of-the-art closed-source models.

Here's how I prompted these models:

I ran all models for 3 seeds — 123, 234, and 345.
In the final version, I prompted models with an instruction, examples, and the current board state. I made sure each model understood these instructions, first.
Model performance across seeds is highly variable, so I took the maximum of their scores. As we'll see later, even this wasn't enough to save the LLMs.
I attempted other variants, such as feeding in previous moves, previous boards, or previous conversation. Additional context didn't improve the high score for these models.

The experiments culminated in this table of results. Both the game logic and integrations are open-sourced, so you can reproduce these numbers yourself.

LLM	Score	Tile
Mistral-7B	332	32
OpenChat-7B	612	64
GPT-3.5-Turbo	800	64
GPT-4	1056	128
Mixtral-8x7B	1600	128

In short, LLMs at best achieve a high score of 1,600 with the largest tile being 128. To make sure my Python version of 2048 was correct, I compared the above scores to minimum and maximum scores from my paper model in How to identify a fake 2048 score. These checks all passed, so the scores are valid.

To understand how "good" this is, let's first look at the quality of the best model's moves. First, here are Mixtral's moves over the course of its 164-move game.

→→→→→→→→→→→→→→→→→→→↑→→→→→→→→→→→→→→

↑↑→→→→→→→→↑→→→→→→↑→→→→→→→→→→→→→→→→

→→→→→↑↑→→→→→→→→→→→→→→→→→→→↑→→→→→→→

→→→→→→↑→→→→→→↑↑↑↑↑↑↑↑→↑↑↑↑↑↑→↑↑↑↑↑↑→→

→→→→→↑→↑→→→↑→→→→↑→→→→→→→→

They're mostly comprised of ups and downs. The variety here is already an improvement over the next-best open-source model. OpenChat's moves for its best game look like the following, over the course of its 83-move game.

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

Yes, you're seeing that right. OpenChat simply opted to mash the up button. I made sure that Huggingface's API cache was disabled, added different examples, and tried modifying the instructions. At the end of the day, OpenChat tends to place full faith in the up button.

Second, I looked at random game states. Here's the last game state for Mixtral.

|4 |8 |32|2 |
|4 |2 |16|32|
|2 |4 |8 |16|
|2 |4 |8 |16|

Clearly, it's not actually game over, because there are adjacent tiles that can be merged. Either up or down would do it. Unfortunately, Mixtral decides on neither and insists on "right", even after prompting it 5 times.

Technically, I could keep dropping temperature and keep sampling to finally get a move that works, but the point still stands: Mixtral isn't making terribly sensible decisions either. This makes me think Mixtral is basically picking randomly.

As we'll see shortly, of all of the possible competitors, it turns out that these are the worst-performing agents of all — even against naive agents.

Even a naive agent is better

I have a suspicion that Mixtral is simply picking randomly, or — as the commentors say — does no better than random. Let's put that to the test.

For a fair comparison with the LLMs above, I'll run on the exact same seeds 123, 234, and 345. Then, I'll report the maximum scores just like before. Let's start with an agent that picks randomly: This agent scores just 764 points, attaining a maximum tile value of 64.

Out of the gate, things are looking good for Mixtral. Random gets just half of Mixtral's best 1600 score. So at least so far, the commentors are wrong. But, as we'll see below, this doesn't mean Mixtral is good. It beats random, but as we'll see shortly, other similarly naive agents vastly outperform Mixtral. Let's add some other agents to the mix.

Random right-up: Mixtral only picked between right and up, so let's pick between right and up randomly — instead of all possible moves randomly.
Random left-up-right-down: Picking randomly is likely no better than just alternating between moves. Let's try this too: Cycle between right and up alternately, infinitely.
Cycle left-up-right-down: Cycle between up-left-down-right infinitely.

Again, for a fair comparison with the LLMs above, I ran these agents on the exact same seeds 123, 234, and 345. Here was each agent's high score.

Agent	Directions	Score	Tile
Random	up,right	1632	128
Random	left,right,up,down	764	64
Cycle	up,right	1920	128
Cycle	left,right,up,down	2880	256

It's now very clear: Even if Mixtral beat the random baseline, it loses out to other very basic agents — agents so basic that they don't even look at the board. So in summary, LLMs are underperforming naive agents in 2048.

From the above, it looks like naively cycling through all of the possible moves performs the best. Let's see if this scales if we increase the sample size. I run each agent on 1000 games, then report their average as well as highest scores.

Agent	Directions	Score (max)	Tile (max)	Score (avg)	Tile (avg)
Random	up,right	3364	256	1293	108
Random	left,right,up,down	2056	256	646	69
Cycle	up,right	2976	256	1659	125
Cycle	left,right,up,down	6364	512	1968	186

This gives us our best-performing automated agent thus far. By just cycling through all possible moves naively, we can achieve a score of 6,364 with the largest value tile being 512.****This vastly outperforms the best LLM, Mixtral, by 4x+ in both total score and highest-value tile.

Algorithms are surprisingly good

A number of very effective algorithms have been presented over the years by nneonneo, ovolve, nicola17 and more — a good portion of them concentrated in a single Stackoverflow question.

Unlike general-purpose LLMs, these algorithms are designed specifically for 2048. Unlike our naive agents above, these algorithms use a combination of exploration and heuristics to explore, rank, and pick possible actions based on the board's current state.

To compare across algorithms, designers report the percentage of times their algorithm wins the game and reaches subsequent larger tiles. Here are some of the best algorithms from that thread. Each column is the percentage of time, across 100 games, that the algorithm reaches that value tile.

Designer	2048	4096	8192	16384	32768
aszczepanski	100%	100%	100%	97%	64%
nneonneo	100%	100%	100%	94%	36%
ovolve	~90%	-	-	-	-
nicola17	~85%	~40%	-	-	-

Note that the percentages above stop at 32,768. Both of the top algorithm designers have noted that their algorithms don't yet attain the 65536 tile — nneonneo and aszczepanski "cauchy" — at the end of their posts.

This is really promising; let's now look at the highest scores these algorithms produce. After all, we're interested in the highest score.

Designer	Score
nneonneo	839732
aszczepanski	609104
ronenz	129892
nicola17	131040

These scores²³, coupled with the previous remarks from the original designers, give us a good idea of where the state-of-the-art algorithms now stand: The best algorithms attain a score of 839,732 with the highest tile value being 32,768. This again outperforms all of our competitors so far, outstripping our naive automated algorithm by over 100x and LLMs by over 500x.

Humans still reign king

Turns out the highest human score is tough to find. It sounds weird saying "human," but I gotta distinguish between automated agents and us somehow. It turns out that a significant portion of Reddit high scores are simply made up — I talk about how to tell if a score is fake in How to identify a fake 2048 score.

Fortunately for us, there's a "2048 masters"⁴ website with a public leaderboard. The leaderboard there is our gold standard for undo-less, purely-human gameplay. There, the highest human score is 840,076, set by Popescu. Although we can't see the highest-value tile Popescu achieved, we can guestimate it's 32,768.

Popsecu achieved at least the 32,768 tile: Popescu's "Super Grandmaster" (Super GM) status, according to the 2048 Master's accreditation page, means they achieved the 32,768 tile twice. They likely achieved 32,768 again to reach their highest score yet.

Popescu did not achieve the 65,536 tile: Another Super GM u/733094_2 achieved a high score of 740,336 and likewise reached the 32,768 tile.

To additionally reach 65,536 from this state, a player would need to increase their score by 65,536 + 32,768 + 16,384 + … + 8 + 4. Equivalently, that's a score increase of $\sum_{i=2}^{16} 2^i = \sum_{i=0}^{16} 2^i - 3 = (2^{17} - 1) - 3 = 2^{17} - 4 = 131,068$.
That would bring the total score to 740,336 + 131,06 = 871,404, whcih is higher than Popescu's score of 840,076. We can roughly guess that Popescu didn't reach the 65,536 tile.

So, in short, humans have attained a high score of 840,000, with the largest tile being 32,768. This again outperforms all of our competitors so far, barely outperforming the best algorithms to date.

Humans may have had an edge in time though. Academic interest appears to have fizzled after 2016, whereas players continued to improve on the highest score until recently. Popescu set his own world record sometime between November 2020 and March 2021.

Takeaways

All of our competitors today — 2048-specific algorithms, human experts, and even naive agents — all outperform the "AI overlord" LLMs. No amounting of prompt engineering will overcome the 500x difference in scores, so we can rest easy for a bit: Algorithm designers and players alike still dominate 2048.

← back to Guide to Hacking

Llama2 requires HuggingFace PRO to call the execution API for, so I've excluded it from these results. ↩
nneonneo's score was produced by an independent researcher — as far as I can tell — running nneonneo's algorithm 1,000 times. According to the author Dr. Olson in his post, "The best instance built a 32.768 tile and stayed alive long enough to reach a score of 839,732." ↩
aszczepanski's score is the reported average over 100 runs for the designer's best algorithm. However, I couldn't find the highest score even in the author's paper. ↩
The website shut down in May 2023, but their leaderboard is still online. In anticipation of the website eventually going down, I've linked to the Web Archive versions of their website, above. ↩