from Guide to Hacking on Jan 21, 2024

Our AI overlords suck at 2048

2048, a small puzzle game, blew up in popularity almost exactly a decade ago. So, why talk about it now? Well, just recently, one adventurous soul tried to run ChatGPT on 2048, and the subsequent "GPT crushes my high score in 2048.io" Hackernews post was torn apart. In short, rather than showering praise on ChatGPT, everyone lamented that the original poster (OP) was likely just bad at 2048.

There was a particular set of comments that caught my interest, however. In short, everyone was certain that even random inputs would crush both the poster and ChatGPT.

It turns out that both OP and the commenter's were both wrong. To OP's chagrin, ChatGPT is indeed performing poorly, but unlike the commenters' hunches, a random agent mashing all the keys isn't quite enough.

LLMs do poorly

Let's look at our "AI overlords," the Large Language Models that have been popularized in recent years. I plugged in both the state-of-the-art instruction-tuned open-source models1 and state-of-the-art closed-source models.

Here's how I prompted these models:

The experiments culminated in this table of results. Both the game logic and integrations are open-sourced, so you can reproduce these numbers yourself.

LLM Score Tile
Mistral-7B 332 32
OpenChat-7B 612 64
GPT-3.5-Turbo 800 64
GPT-4 1056 128
Mixtral-8x7B 1600 128

In short, LLMs at best achieve a high score of 1,600 with the largest tile being 128. To make sure my Python version of 2048 was correct, I compared the above scores to minimum and maximum scores from my paper model in How to identify a fake 2048 score. These checks all passed, so the scores are valid.

To understand how "good" this is, let's first look at the quality of the best model's moves. First, here are Mixtral's moves over the course of its 164-move game.

→→→→→→→→→→→→→→→→→→→↑→→→→→→→→→→→→→→

↑↑→→→→→→→→↑→→→→→→↑→→→→→→→→→→→→→→→→

→→→→→↑↑→→→→→→→→→→→→→→→→→→→↑→→→→→→→

→→→→→→↑→→→→→→↑↑↑↑↑↑↑↑→↑↑↑↑↑↑→↑↑↑↑↑↑→→

→→→→→↑→↑→→→↑→→→→↑→→→→→→→→

They're mostly comprised of ups and downs. The variety here is already an improvement over the next-best open-source model. OpenChat's moves for its best game look like the following, over the course of its 83-move game.

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

Yes, you're seeing that right. OpenChat simply opted to mash the up button. I made sure that Huggingface's API cache was disabled, added different examples, and tried modifying the instructions. At the end of the day, OpenChat tends to place full faith in the up button.

Second, I looked at random game states. Here's the last game state for Mixtral.

|4 |8 |32|2 |
|4 |2 |16|32|
|2 |4 |8 |16|
|2 |4 |8 |16|

Clearly, it's not actually game over, because there are adjacent tiles that can be merged. Either up or down would do it. Unfortunately, Mixtral decides on neither and insists on "right", even after prompting it 5 times.

Technically, I could keep dropping temperature and keep sampling to finally get a move that works, but the point still stands: Mixtral isn't making terribly sensible decisions either. This makes me think Mixtral is basically picking randomly.

As we'll see shortly, of all of the possible competitors, it turns out that these are the worst-performing agents of all — even against naive agents.

Even a naive agent is better

I have a suspicion that Mixtral is simply picking randomly, or — as the commentors say — does no better than random. Let's put that to the test.

For a fair comparison with the LLMs above, I'll run on the exact same seeds 123, 234, and 345. Then, I'll report the maximum scores just like before. Let's start with an agent that picks randomly: This agent scores just 764 points, attaining a maximum tile value of 64.

Out of the gate, things are looking good for Mixtral. Random gets just half of Mixtral's best 1600 score. So at least so far, the commentors are wrong. But, as we'll see below, this doesn't mean Mixtral is good. It beats random, but as we'll see shortly, other similarly naive agents vastly outperform Mixtral. Let's add some other agents to the mix.

  1. Random right-up: Mixtral only picked between right and up, so let's pick between right and up randomly — instead of all possible moves randomly.
  2. Random left-up-right-down: Picking randomly is likely no better than just alternating between moves. Let's try this too: Cycle between right and up alternately, infinitely.
  3. Cycle left-up-right-down: Cycle between up-left-down-right infinitely.

Again, for a fair comparison with the LLMs above, I ran these agents on the exact same seeds 123, 234, and 345. Here was each agent's high score.

Agent Directions Score Tile
Random up,right 1632 128
Random left,right,up,down 764 64
Cycle up,right 1920 128
Cycle left,right,up,down 2880 256

It's now very clear: Even if Mixtral beat the random baseline, it loses out to other very basic agents — agents so basic that they don't even look at the board. So in summary, LLMs are underperforming naive agents in 2048.

From the above, it looks like naively cycling through all of the possible moves performs the best. Let's see if this scales if we increase the sample size. I run each agent on 1000 games, then report their average as well as highest scores.

Agent Directions Score (max) Tile (max) Score (avg) Tile (avg)
Random up,right 3364 256 1293 108
Random left,right,up,down 2056 256 646 69
Cycle up,right 2976 256 1659 125
Cycle left,right,up,down 6364 512 1968 186

This gives us our best-performing automated agent thus far. By just cycling through all possible moves naively, we can achieve a score of 6,364 with the largest value tile being 512. This vastly outperforms the best LLM, Mixtral, by 4x+ in both total score and highest-value tile.

Algorithms are surprisingly good

A number of very effective algorithms have been presented over the years by nneonneo, ovolve, nicola17 and more — a good portion of them concentrated in a single Stackoverflow question.

Unlike general-purpose LLMs, these algorithms are designed specifically for 2048. Unlike our naive agents above, these algorithms use a combination of exploration and heuristics to explore, rank, and pick possible actions based on the board's current state.

To compare across algorithms, designers report the percentage of times their algorithm wins the game and reaches subsequent larger tiles. Here are some of the best algorithms from that thread. Each column is the percentage of time, across 100 games, that the algorithm reaches that value tile.

Designer 2048 4096 8192 16384 32768
aszczepanski 100% 100% 100% 97% 64%
nneonneo 100% 100% 100% 94% 36%
ovolve ~90% - - - -
nicola17 ~85% ~40% - - -

Note that the percentages above stop at 32,768. Both of the top algorithm designers have noted that their algorithms don't yet attain the 65536 tile — nneonneo and aszczepanski "cauchy" — at the end of their posts.

This is really promising; let's now look at the highest scores these algorithms produce. After all, we're interested in the highest score.

Designer Score
nneonneo 839732
aszczepanski 609104
ronenz 129892
nicola17 131040

These scores23, coupled with the previous remarks from the original designers, give us a good idea of where the state-of-the-art algorithms now stand: The best algorithms attain a score of 839,732 with the highest tile value being 32,768. This again outperforms all of our competitors so far, outstripping our naive automated algorithm by over 100x and LLMs by over 500x.

Humans still reign king

Turns out the highest human score is tough to find. It sounds weird saying "human," but I gotta distinguish between automated agents and us somehow. It turns out that a significant portion of Reddit high scores are simply made up — I talk about how to tell if a score is fake in How to identify a fake 2048 score.

Fortunately for us, there's a "2048 masters"4 website with a public leaderboard. The leaderboard there is our gold standard for undo-less, purely-human gameplay. There, the highest human score is 840,076, set by Popescu. Although we can't see the highest-value tile Popescu achieved, we can guestimate it's 32,768.

Popsecu achieved at least the 32,768 tile: Popescu's "Super Grandmaster" (Super GM) status, according to the 2048 Master's accreditation page, means they achieved the 32,768 tile twice. They likely achieved 32,768 again to reach their highest score yet.

Popescu did not achieve the 65,536 tile: Another Super GM u/733094_2 achieved a high score of 740,336 and likewise reached the 32,768 tile.

So, in short, humans have attained a high score of 840,000, with the largest tile being 32,768. This again outperforms all of our competitors so far, barely outperforming the best algorithms to date.

Humans may have had an edge in time though. Academic interest appears to have fizzled after 2016, whereas players continued to improve on the highest score until recently. Popescu set his own world record sometime between November 2020 and March 2021.

Takeaways

All of our competitors today — 2048-specific algorithms, human experts, and even naive agents — all outperform the "AI overlord" LLMs. No amounting of prompt engineering will overcome the 500x difference in scores, so we can rest easy for a bit: Algorithm designers and players alike still dominate 2048.


back to Guide to Hacking



  1. Llama2 requires HuggingFace PRO to call the execution API for, so I've excluded it from these results. 

  2. nneonneo's score was produced by an independent researcher — as far as I can tell — running nneonneo's algorithm 1,000 times. According to the author Dr. Olson in his post, "The best instance built a 32.768 tile and stayed alive long enough to reach a score of 839,732." 

  3. aszczepanski's score is the reported average over 100 runs for the designer's best algorithm. However, I couldn't find the highest score even in the author's paper

  4. The website shut down in May 2023, but their leaderboard is still online. In anticipation of the website eventually going down, I've linked to the Web Archive versions of their website, above.