type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
 
Since OpenAI released its "o1-series" model, several teams have developed their own approaches to "deep thinking" models. DeepSeek introduced their o1-like model, DeepSeek-R1-Lite, while Qwen released QwQ-32B-Preview, and Intern launched Intern Thinker.
In this blog, I will evaluate these models' performance and share my observations.
 

Question 1: What is the smallest integer whose square is between 15 and 30 ?

 
This is a tricky question for language models. They need to consider both positive and negative integers comprehensively, and understand that the definition of "smallest" differs when dealing with positive versus negative numbers.
DeepSeek-R1-Lite
Here is how it deeply thinks and I highly marked some interesting places.
 
It's fascinating to see how the model thinks step by step to arrive at its answer. From DeepSeek-R1-Lite's internal reasoning, we can see it considers negative numbers but shows uncertainty about the definition of integers and the question's expectations. This leads it to thoroughly present two possible answers. Mathematically, this approach is partially correct—since integers include both positive and negative numbers, we must consider negative values. Therefore, focusing only on positive numbers would yield an incorrect answer. I appreciate how DeepSeek-R1-Lite approaches the problem comprehensively by examining multiple conditions.
 
QwQ 32B Preview
 
notion image
 
I prefer QWQ 32B Preview's answer as it provides clear reasoning and arrives at a definitive solution. It demonstrates a solid understanding of integers and confidently identifies -5 as the smallest answer.
 
InternThinker
 
InternThinker fails to arrive at the correct answer, demonstrating lower performance than the other two models in this particular case. Based on its output, the training methodology appears to differ from that used by QwQ 32B Preview and DeepSeek-R1-Lite.
 

Question 2: Please exchange the second word and the second last words in the following sentence: I want to go to school on Saturday.

 
This is a challenging question for most LLMs since they typically struggle with understanding positional relationships in text. Let's see how our "deep-thinking" model handles it.
 
DeepSeek-R1-Lite
 
DeepSeek-R1-Lite demonstrates clear reasoning and solves this problem systematically. Its answer is correct.
 
QwQ 32B Preview
 
QWQ 32B Preview's internal thinking process is much longer than DeepSeek-R1-Lite's even though it also gets the correct answer finally. While QWQ 32B Preview falls into a logic trap—second-guessing itself because the sentence sounds unusual—DeepSeek-R1-Lite takes a more straightforward approach to reach the correct solution. For this reason, I find DeepSeek-R1-Lite's answer more effective.
 
InternThinker
 
Like QwQ 32B Preview, InternThinker falls into the logic trap of questioning the grammaticality of the result. However, unlike QwQ 32B Preview, InternThinker ultimately fails to provide the correct answer.
 

Question 3: If it takes 1 hour to dry 25 clothes under the sun, how many hours does it need to dry 30 clothes?

 
This is a challenging problem to LLMs because most of them easily think the drying time is directly proportional to the number of the clothes. But this is a wrong idea. So, let’s try this question for our “deep-thinking” models.
 
DeepSeek-R1-Lite
For this question, DeepSeek-R1-Lite struggles to find the correct answer. It arrives at a wrong conclusion despite considering real-world conditions. Although it thoroughly explores various scenarios and possibilities, its final answer remains incorrect.
 
QwQ 32B Preview
 
QwQ 32B Preview also struggles with this question. Though it occasionally approaches the correct answer, it repeatedly second-guesses itself and follows incorrect reasoning. Its performance mirrors that of DeepSeek-R1-Lite, as both models exhaustively consider various conditions, ultimately rejecting the correct solution in favor of an incorrect path.
 
InternThinker
 
InternThinkerprovides an incorrect answer to this question. Unlike the other models, it proceeds directly to a solution without exploring various real-world conditions and possibilities.

Insights

After testing three tricky and challenging questions in this blog, I have discovered some interesting insights:
  • QwQ 32B Preview and DeepSeek-R1-Lite both perform well. Upon examining their internal thinking processes, I observed that they thoroughly explore all possible conditions to arrive at their answers.
  • InternThinkerunderperforms compared to the others, showing poor results across all three questions. Its responses follow an identical format regardless of the question, which suggests the team may have prioritized prompt engineering over thorough model training. This makes it fundamentally different from the o1 series,QwQ 32B Preview, andDeepSeek-R1-Lite.
 
 
 
 
 
Dec 4, Recap for November Nov 26, Explore DSPy on BootstrapFinetune
Loading...