Oct 30, LLMs cannot Play the Snake Game

type

status

date

slug

summary

1-Introduction

In this blog, I'll introduce a novel approach to evaluating LLM performance. While you might find this idea preliminary and the experiment not exhaustive, I hope it will provide some valuable insights into assessing LLMs.

The Idea

Let me introduce this concept. It's straightforward: have LLMs play the snake game (not by writing code, but by manipulating the game as a human player would). If an LLM is sufficiently advanced, it should be able to achieve high scores in the game.

Origins of the Idea

This idea stems from a blog post about the Sonnet 3.5 Refresh Benchmark. The author used Claude Artifacts and a simple prompt—"Create an Asteroids game"—to compare the new Sonnet 3.5 with its previous version.

That's the inspiration for my idea. However, I want to take it a step further. Instead of just using a model to write code to develop a game, I propose having the LLM play the game like a human user.

Why Snake Game

There are two main reasons for choosing the Snake game. First, its rules are simple to grasp, yet it offers comprehensive assessment dimensions including decision-making, planning, spatial cognition, and strategy formation. Second, it provides objective quantification through a clear scoring system based on food collection count. Moreover, its complexity is adjustable—we can modify difficulty by changing map size and implementing various food spawning rules to test different scenarios. Perhaps most importantly, this method can successfully avoid the data contamination problems that plague many public benchmarks.

2-Experiments

While my experimental approach may not be rigorously scientific, it offers valuable insights into LLM performance.

2.1-Setup

I tested a select group of models: chatgpt-4o-latest, o1-mini , claude-3-5-sonnet-20240620 , claude-3-5-sonnet-20241024, and gpt-4 (with potential for future expansion). For all models, I set the temperature to 1.0 and the maximum tokens to 4096. The snake game's environment size is set to . The game interface is a text-based grid in the terminal window, with 'O' representing the snake head, ‘o’ means snake body and '*' representing food. The LLM receives various game information as input—such as game boundaries, wall positions, and current state—and must respond with a direction (UP, DOWN, LEFT, or RIGHT) to move the snake. This setup allows for a clear assessment of the LLM's ability to interpret the game state and make strategic decisions.

2.2-Prompt

I'll now share the prompt I used to instruct the LLM in making decisions. Here's the prompt:

As you can see, this prompt provides extensive information to the LLM, including the snake's head and body locations, wall positions, and game rules. LLMs use this comprehensive data to make their decisions.

3-Results

In this section, I'll present the results of my experiment. The findings are as follows:

Model	Round	Score	Failure Reason	Average Score
ㅤ	1	7	collision to the wall	ㅤ
claude-3-5-sonnet-20241022	2	2	collision to the body	5.67
ㅤ	3	8	collision to the wall	ㅤ
ㅤ	1	4	collision to the wall	ㅤ
claude-3-5-sonnet-20240620	2	2	collision to the wall	3.33
ㅤ	3	4	collision to the wall	ㅤ
ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
ㅤ	1	0	collision to the wall	ㅤ
chatgpt-4o-latest	2	0	collision to the wall	0
ㅤ	3	0	collision to the wall	ㅤ
ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
ㅤ	1	1	collision to the wall	ㅤ
gpt-4	2	2	collision to the body	1.67
ㅤ	3	2	collision to the wall	ㅤ
ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
o1-mini	1	11	Request timed out.	11

As we can see, o1-mini performs best in the snake game, achieving a score of 11. I only ran it once due to its slow response time. Claude outperforms GPT, regardless of whether it's the old or new Claude version. The latest Claude version, claude-3-5-sonnet-20241022, shows impressive performance, especially considering it makes decisions much faster than o1-mini. I believe Claude's superior performance over GPT stems from reinforcement learning, which plays a crucial role in enhancing the model's planning, reasoning, and decision-making capabilities.

This underscores the importance of reinforcement learning in developing LLMs capable of complex decision-making tasks. While these results are promising, it's important to note that this experiment is still in its early stages. Further testing with a wider range of models and more extensive gameplay scenarios could provide even more insights into the capabilities and limitations of different LLM architectures in handling dynamic, real-time decision-making tasks.

4-Conclusion

In conclusion, evaluating LLMs through their ability to play the Snake game offers a unique perspective on their performance, decision-making, and strategic planning capabilities. Although the experiment is in its initial phase and not exhaustive, it highlights the potential of using game-based assessments to delve deeper into understanding LLM competency. The results indicate that models like Claude, with reinforcement learning enhancements, are better equipped to handle dynamic and strategic tasks compared to others like GPT. This underscores the significance of reinforcement learning in advancing LLM development. Future research should focus on expanding the range of models tested and using more varied gameplay scenarios to further explore the potential and limitations of LLMs in real-time decision-making tasks.