Evaluation | Tags | BubbleBrain

🤩Oct 30, LLMs cannot Play the Snake Game

The blog introduces a novel method for evaluating LLM performance by having them play the Snake game, assessing their decision-making, planning, and strategy skills. The experiment tested several models, revealing that o1-mini performed best with a score of 11, while Claude models outperformed GPT models. The findings suggest that reinforcement learning significantly enhances LLMs' capabilities in dynamic decision-making tasks. Although preliminary, this approach highlights the potential of game-based assessments for deeper insights into LLM competencies, with recommendations for further testing across more models and scenarios.

2024

LLM

Evaluation

📌Sep 25，Notes on Gemini models

Google has announced significant updates to their production-ready Gemini models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.

Sep 19,Notes on Qwen2.5

The Qwen Team has released the new Qwen2.5 series models, potentially the largest open-source release in history.

👉🏿Sep 13, Notes on OpenAI o1 series models

OpenAI has introduced its new o1 series models, which are large language models trained utilizing reinforcement learning techniques to enhance complex reasoning capabilities.

Sep 9, test DeepSeek-V2.5 and Reflection-70b

This blog post offers a personal evaluation of two recently released language models: DeepSeek-V2.5 and Reflection-70b.

LLM

Evaluation

2024

Sep 9, test DeepSeek-V2.5 and Reflection-70b

July 31, LLM/VLM-as-a-Judge

With the rapid development of LLMs, the community requires an efficient and accurate method to automatically evaluate LLM performance, as human annotation is tedious and time-consuming. LLM-as-a-Judge is now an optimized solution for this need.

🚞July 23, Test with Chameleon From Meta

In this short blog, I will test Chameleon, the newest multimodal model from Meta. The baseline models I will choose are GPT-4o, Gemini-1.5-pro, Yi-vision and Yi-Vision-with-TextGrad.

July 16, LLMs Evals Thoughts

Evaluating LLMs is important for understanding their abilities and solving real business problems. A good evaluation requires sufficient and high-quality data samples, clear judging criteria, meaningful evaluation tasks, and frequent private benchmarks. The process should adapt to the development of LLMs over time.

LLM

Evaluation

2024

July 5, LLMs Evaluation Benchmarks

As the capabilities of Large Language Models (LLMs) continue to evolve, many traditional evaluation benchmarks may require updates. With the rapid progress of these models, researchers are increasingly introducing new evaluation datasets. However, the specific dimensions these datasets assess in the models are often unclear. In this blog, I will explore a series of commonly referenced evaluation datasets and highlight the particular aspects of model capabilities they were designed to assess even though I may not cover all available datasets.

LLM

Evaluation

2024

BubbleBrain