type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
Introduction
This blog post offers a personal evaluation of two recently released language models:
DeepSeek-V2.5
and Reflection-70b
. I accessed the tested versions through their respective platforms—DeepSeek's official website for V2.5 and Hyperbolic for Reflection-70b.Test Cases and Results
Question 1
The answer from
DeepSeek-V2.5
is:The answer from
Reflection-70b
is:This is a classic trick question. Despite its longer output due to the inclusion of
&<thinking>
, <reflection>
, and <output>
tags, Reflection-70b
still fails to provide the correct answer.
Question 2
The answer from
DeepSeek-V2.5
is:Reflection-70b's response is as follows:
Clearly,
Reflection-70b
provides the correct answer, while DeepSeek-V2.5
misses the mark in this case. Reflection-70b
's approach stands out for its clear, structured method in tackling the question.Question 3
The response from
DeepSeek-V2.5
is:Reflection-70b
responds:This is clearly a trick question. The sentence "I have an apple" contains only four words, not five. While
Reflection-70b
correctly identifies this,DeepSeek-V2.5
fails to recognize the absence of a fifth word.<ins/>
Some Ideas Here
I think
Reflection-70b
is a not bad model from my personal tiny test and also, this test cannot show DeepSeek-V2.5
is an inferior model to Reflection-70b
because the system prompts of this two models are different. Actually, many researchers already show the technique, “Self-Reflection” does really improves the performance. I wonder that what if I use the same system prompt with Reflection-70b
on the DeepSeek-V2.5
. So, I wanna test the same question for DeepSeek-V2.5
for the same system prompt which is used for Reflection-70b
.
The system prompt for Reflection-70b
is shown as follows:Question 1
The answer from
DeepSeek-V2.5
which used the same system prompt for Reflection-70b
is:Question 2
The answer from
DeepSeek-V2.5
which used the same system prompt for Reflection-70b
is: Question 3
The answer from
DeepSeek-V2.5
which used the same system prompt for Reflection-70b
is:Some Thoughts
The results of using
Reflection-70b
's system prompts with DeepSeek-V2.5
are quite intriguing. All three questions were answered correctly, which is a significant improvement from the initial test. This outcome strongly suggests that the reflection-based prompt structure plays a crucial role in enhancing the model's performance. The ability to think through a problem, reflect on potential errors, and then provide a final output seems to contribute substantially to the accuracy and thoughtfulness of the responses. This observation highlights the importance of prompt engineering in maximizing the capabilities of language models, regardless of their underlying architecture or training data.<ins/>
- Author:Chengsheng Deng
- URL:https://chengshengddeng.com/article/test-deepseek-reflection
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts