type
status
date
slug
summary
tags
category
password
icon
Author
Abstract

Introduction

This blog post offers a personal evaluation of two recently released language models: DeepSeek-V2.5 and Reflection-70b. I accessed the tested versions through their respective platforms—DeepSeek's official website for V2.5 and Hyperbolic for Reflection-70b.

Test Cases and Results

Question 1

The answer from DeepSeek-V2.5 is:
The answer from Reflection-70b is:
This is a classic trick question. Despite its longer output due to the inclusion of &<thinking>, <reflection>, and <output> tags, Reflection-70b still fails to provide the correct answer.

Question 2

The answer from DeepSeek-V2.5 is:
Reflection-70b's response is as follows:
Clearly, Reflection-70b provides the correct answer, while DeepSeek-V2.5 misses the mark in this case. Reflection-70b's approach stands out for its clear, structured method in tackling the question.

Question 3

The response from DeepSeek-V2.5 is:
Reflection-70b responds:
This is clearly a trick question. The sentence "I have an apple" contains only four words, not five. WhileReflection-70bcorrectly identifies this,DeepSeek-V2.5fails to recognize the absence of a fifth word.
<ins/>

Some Ideas Here

I think Reflection-70b is a not bad model from my personal tiny test and also, this test cannot show DeepSeek-V2.5 is an inferior model to Reflection-70b because the system prompts of this two models are different. Actually, many researchers already show the technique, “Self-Reflection” does really improves the performance. I wonder that what if I use the same system prompt with Reflection-70b on the DeepSeek-V2.5. So, I wanna test the same question for DeepSeek-V2.5 for the same system prompt which is used for Reflection-70b. The system prompt for Reflection-70b is shown as follows:

Question 1

The answer from DeepSeek-V2.5 which used the same system prompt for Reflection-70b is:
 

Question 2

The answer from DeepSeek-V2.5 which used the same system prompt for Reflection-70b is:

Question 3

The answer from DeepSeek-V2.5 which used the same system prompt for Reflection-70b is:

Some Thoughts

The results of using Reflection-70b's system prompts with DeepSeek-V2.5 are quite intriguing. All three questions were answered correctly, which is a significant improvement from the initial test. This outcome strongly suggests that the reflection-based prompt structure plays a crucial role in enhancing the model's performance. The ability to think through a problem, reflect on potential errors, and then provide a final output seems to contribute substantially to the accuracy and thoughtfulness of the responses. This observation highlights the importance of prompt engineering in maximizing the capabilities of language models, regardless of their underlying architecture or training data.
 
<ins/>
Sep 13, Notes on OpenAI o1 series modelsSep 3, Notes on Anthropic Prompt Tutorial
Loading...