type
status
date
slug
summary
tags
category
password
icon
Domain & Institution
Author
Priority
Abstract
Creation Date
Welcome to my latest notes on Grok 3. In this blog post, I'll share my observations and highlight some fascinating test cases comparing Grok 3 with deepseek-r1 and o3-mini.

Information

XAI has introduced Grok 3 with two beta reasoning models: Grok 3 (Think) and Grok 3 mini(Think). These models were trained using reinforcement learning (RL) at an unprecedented scale, refining their chain-of-thought processes to enable advanced, data-efficient reasoning.
Below is a benchmark graph showing Grok 3's thinking model performance:
notion image
 
For the general model, Grok 3 with a context window of 1 million tokens also demonstrates very impressive performance. Here it is:
notion image

Interesting Test Cases

Dave W Plummer conducted a fascinating Breakout test with Grok 3. Here are the results
 
The initial prompt was simple: "How about a colored version of Breakout?" The first revision requested, "Make the player move automatically under computer control, and make the ball go 10% faster each time it bounces off the paddle." The final revision addressed a gameplay issue: "Good, but the ball can get stuck in a vertical bounce. How did the original game handle that? Do the same! And make the player aim for remaining bricks."
For detailed information, you can check here: Breakout by Grok3
Theo-t3.gg shows Grok 3 is not great at coding. Here is his demonstration case:
 
Alex Prompter tested Grok 3 and DeepSeek v3 with the same critical prompts. His extensive comparison tests revealed multiple insights. For more details, see: Grok 3 VS. DeepSeek V3
Andrej Karpathy conducted a thorough comparison between Grok 3, OpenAI's o1-pro, and DeepSeek-R1. His tests showed Grok 3's strong performance in reasoning tasks, such as Settlers of Catan board generation and GPT-2 training flop estimation. However, the model struggled with complex spatial tasks, particularly generating accurate SVG images of a pelican riding a bicycle. For the complete analysis, see: Grok 3 test by Andrej Karpathy
 
<ins/>
The First Pages of 2025 - My January & February StoryFeb 8, Notes on Policy Gradient
Loading...