🤩Oct 30, LLMs cannot Play the Snake Game

The blog introduces a novel method for evaluating LLM performance by having them play the Snake game, assessing their decision-making, planning, and strategy skills. The experiment tested several models, revealing that o1-mini performed best with a score of 11, while Claude models outperformed GPT models. The findings suggest that reinforcement learning significantly enhances LLMs' capabilities in dynamic decision-making tasks. Although preliminary, this approach highlights the potential of game-based assessments for deeper insights into LLM competencies, with recommendations for further testing across more models and scenarios.

Lazy loaded imageJuly 16, LLMs Evals Thoughts

Evaluating LLMs is important for understanding their abilities and solving real business problems. A good evaluation requires sufficient and high-quality data samples, clear judging criteria, meaningful evaluation tasks, and frequent private benchmarks. The process should adapt to the development of LLMs over time.

Lazy loaded imageJuly 5, LLMs Evaluation Benchmarks

As the capabilities of Large Language Models (LLMs) continue to evolve, many traditional evaluation benchmarks may require updates. With the rapid progress of these models, researchers are increasingly introducing new evaluation datasets. However, the specific dimensions these datasets assess in the models are often unclear. In this blog, I will explore a series of commonly referenced evaluation datasets and highlight the particular aspects of model capabilities they were designed to assess even though I may not cover all available datasets.
Chengsheng Deng
Chengsheng Deng
Chengsheng Deng
Latest posts
Mar 24 Notes on LightRAG
Mar 24, 2025
Dec 6, Some Tests on o1
Mar 14, 2025
Mar 10, Note on BIG-MATH
Mar 10, 2025
Mar 6, Note on QwQ-32B
Mar 6, 2025
Jan 21, Notes on DeepSeek-R1
Mar 6, 2025
The First Pages of 2025 - My January & February Story
Mar 5, 2025
Announcement
🎉Welcome to my blog🎉 
To find me:
Twitter/X:My X
👏Have fun in my blog👏