type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
Inspired by Nezhurina et al. 2024, I employ similar questions to evaluate various leading language models, demonstrating their reasoning capabilities. Thus, this blog will resemble a test report. This test is very subjective. So, if the outcome does not meet your expectations, just take it in stride.

1-Standard Prompt

1-1-Varation-1

💡
Alice has 3 sisters and she also has 4 brothers. How many sisters does Alice’s brother have?
ChatGPT-4
ChatGPT-4o
Qwen2-72B-instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
In this section, although all models perform reasoning, only Qwen1.5-110B-Chat and Claude-3-opus-2024-0229 provide correct answers. Qwen1.5-110B-Chat is an open-source model from Alibaba, while Claude-3-opus-2024-0229 uses the classic CoT framework, even without any CoT triggers in the prompt. The other models made mistakes during reasoning.

1-2-Varation-2

💡
Alice has M sisters and she also has N brothers. How many sisters does Alice’s brother have?
ChatGPT-4
ChatGPT-4o
Qwen2-72B-instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
In this section, although all these models perform reasoning, 'im-a-good-gpt2-chatbot' is the only one that gets the correct answer.. Despite rumors that this model originates from OpenAI, both ChatGPT-4 and ChatGPT-4o are unable to provide the correct answer.

2-Variation Prompt

In this section, I will not use the standard prompt. Instead, I will employ strategies like CoT, Self-Consistency, few-shot, and others to determine if the model can provide the correct response.

2-1-Variation-1 Two-shot

💡
Alice has 2 sisters and she also has3 brothers. How many sisters does Alice's brother have? ##Answer: 3 Alice has 4 sisters and she also has 2 brothers. How many sisters does Alice's brother have? ## Answer: 5 Alice has M sisters and she also has N brother. How many sisters does Alice’s brother have?
ChatGPT-4
ChatGPT-4o
Qwen2-72B-Instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
When utilizing the two-shot prompt technique, we discover that both Yi-large-preview from 01-ai and Gemini-1.5-pro-api-0514 from Google answer the question correctly. I personally favor Yi-large-preview's response because it correctly analyzes and reasons the two examples provided. Surprisingly, both ChatGPT-4 and ChatGPT-4o from OpenAI fail to provide the correct answer. However, im-a-good-gpt2-chatbot delivers the right answer without any doubt.

2-2-Variation-2 Zero-Shot CoT

💡
Alice has M sisters and she also has N brothers. How many sisters does Alice’s brother have? Think it step by step.
ChatGPT-4
ChatGPT-4o
Qwen2-72B-Instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
Zero-shot CoT is renowned for its "think step by step" approach. When using this, we find that only ChatGPT-4 provides the correct answer. Both Yi-large-preview and Gemini-1.5-pro-api-0514, which previously gave the correct answer in Two-shot, are now incorrect. Finally, im-a-good-gpt2-chatbot delivers the correct answer without any doubt.

2-3-Variation-3 One-Shot CoT

💡
Kitty has 4 sisters and she also has 3 brothers. How many sisters does Kitty’s brother have? Since Kitty’s brother should share same sisters with Kitty, except for Kitty. So we need to count Kitty as another sister for her brother as well. Therefore, Kitty’s brother has 4+1 = 5 sisters. Alice has M sisters and she also has N brothers. How many sisters does Alice’s brother have?
ChatGPT-4
ChatGPT-4o
Qwen2-72B-Instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
In this section, we utilize one-shot CoT to enhance model performance. We discover that many models can successfully answer the question, with the exception of glm-4-0520, Yi-large-preview, Qwen2-72B-Instruct, and Qwen1.5-110B-Chat. Despite using one-shot CoT as a demonstration, these models still fail to answer the question correctly.

2-4-Variation-4 Self-Consistency

💡
Alice has M sisters and she also has N brothers. How many sisters does Alice’s brother have? Show five thinking methods. The final answer should be the one with the highest frequency.
ChatGPT-4
ChatGPT-4o
Qwen2-72B-Instruct
Qwen1.5-110B-Chat
Claude-3-opus-2024-0229
Gemini-1.5-pro-api-0514
Yi-large-preview
DeepSeek-V2 Chat
glm-4-0520
im-a-good-gpt2-chatbot
In this section, I use Self-Consistency to test the model. I've found that most models lean towards the belief that Alice should not be considered a sister from her brother's perspective, failing to answer correctly. However, the model im-a-good-gpt2-chatbot is an exception. It correctly answers the question using five different methods.

3-Conclusion and Limitation

I've tested many of the main models currently on the market, discovering that most struggle to correctly answer the Alice Question and its variations. Despite their ability to perform reasoning, these models often err during the reasoning process. Techniques like CoT and few-shot remain useful, as do classic prompting methods.
However, this test has some limitations. For one, I didn't repeat the questions multiple times, which might have mitigated potential model instability. Additionally, other prompting techniques that I didn't test could potentially answer the question correctly. Lastly, although the LLM is a powerful tool, this question does not necessarily showcase its capabilities, even if it had difficulty providing a correct answer.
 
June 30, DSPyJune27, Recap for June
Loading...