June 30, DSPy | BubbleBrain

type

status

date

slug

summary

What is DSPy?

DSPy is a framework developed by Stanford. It is used for programming to automatically optimize prompts and weights in Large Language Models (LLMs). DSPy can enhance the reliability of any model, whether it's GPT-4, LLaMA3 or Mistral, for any task you require.

In this blog, I will demonstrate how to use DSPy on the DeepSeek to solve the well-known Alice in Wonderland problem (Nezhurina et al. 2024), along with another challenging question that most LLMs struggle to answer correctly.

Alice in Wonderland problem

In fact, I have written this problem several times as I believe it's an excellent way to test the model's reasoning ability. Here is the problem:

✂️

Alice has 3 sisters and she also has 4 brothers. How many sisters does Alice’s brother have?

Let’s get the answer from DeepSeek without any optimization first.

DSPy supports many LM clients and also supports your customize LM clients. There are 3 methods at minimum should be implemented: __init__ , basic_request, and __call__ . If you want to ensure every feature like inspect_history can also be worked, please also define inspect_history .

DeepSeek provides the answer: Alice's brother has the same number of sisters as Alice does, because they share the same sisters. Since Alice has 3 sisters, Alice's brother also has 3 sisters. It appears that DeepSeek has a sound reasoning process, but it delivers an incorrect result.

Now, let's utilize DSPy to optimize this answer and see if it can correct the mistake.

Let's examine what I've done. Firstly, I defined a simple COT class that inherits from dspy.module. Then, I load the GSM8K Maths dataset and sampled 50 instances for training. After training, I choose BootStrapFewShot as my optimizer. I set the metric as gsm8k_metric which is predefined and max_bootstrapped_demos=10 , and max_labeled_demos=10 . There are some differences between this two hyperparameters:

max_labeled_demos refers to the maximum number of labeled examples used directly for training the student modules.

max_bootstrapped_demos represents the maximum number of demonstrations that will be bootstrapped. In this context, bootstrapping likely involves generating new training examples based on the predictions of a teacher module or another process. These bootstrapped demonstrations are then used either in conjunction with or as a replacement for the manually labeled examples.

After training with some demonstrations, DeepSeek now provides an answer with 4. Let's examine the history to see how it inferred this response.

Don't be intimidated by this history; it's similar to few-shot demonstrations. DeepSeek used 10 example shots to learn and ultimately solve our Alice problem. Just imagine if we didn't use DSPy for optimization. I might have needed to manually write the 10 examples in my prompt to teach DeepSeek how to answer this question. Now, this entire process is automated.

Tricky Math Problem

In fact, this problem is about mathematics but it is very easy. The question is as follows:

✂️

what is the smallest integer whose square is between 15 and 30? consider negative and positive integer?

Everyone who graduated from middle school can find the correct answer easily which is -5. But this is not for LLMs. Let’s see what GPT-4o, DeepSeek, Claude 3.5 Sonnet get.

For GPT-4o:

❓

To find the smallest integer whose square is between 15 and 30, we need to identify the integer \( n \) such that \( 15 < n^2 < 30 \).

Let's start by checking the squares of integers:

\( 1^2 = 1 \)

\( 2^2 = 4 \)

\( 3^2 = 9 \)

\( 4^2 = 16 \)

\( 5^2 = 25 \)

\( 6^2 = 36 \)

We see that \( 4^2 = 16 \) and \( 5^2 = 25 \) are the squares that fall between 15 and 30. Therefore, the smallest integer \( n \) whose square is between 15 and 30 is 4.

Considering negative integers as well, the negative counterpart of 4 is -4. Since \((-4)^2 = 16\), it also falls within the range.

Thus, the smallest integer (considering both positive and negative) whose square is between 15 and 30 is \(-4\).

For DeepSeek:

🚫

To find the smallest integer whose square is between 15 and 30, we need to consider both negative and positive integers. Let's analyze this step by step.

Identify the range for squares:

We need the square of an integer to be greater than 15 and less than 30.

Calculate the square roots:

The square root of 15 is approximately 3.87.

The square root of 30 is approximately 5.48.

Determine the integers within this range:

The integers between 3.87 and 5.48 are 4 and 5.

Similarly, the negative integers between -3.87 and -5.48 are -4 and -5.

Check the squares of these integers:

\(4^2 = 16\)

\(5^2 = 25\)

\((-4)^2 = 16\)

\((-5)^2 = 25\)

Identify the smallest integer:

Both 4 and -4 have squares that fall within the range of 15 to 30.

Since we are looking for the smallest integer, we consider the negative integer first.

Therefore, the smallest integer whose square is between 15 and 30 is \(-4\).

For Claude 3.5 Sonnet:

🚫

Let's approach this step-by-step:

We're looking for an integer x such that 15 < x² < 30.

Let's consider positive integers first:

3² = 9, which is too small

4² = 16, which is within our range

5² = 25, which is also within our range

6² = 36, which is too large

Now, let's consider negative integers:

(-4)² = 16, which is within our range

(-5)² = 25, which is also within our range

The smallest absolute value that works is 4.

Therefore, the smallest integer whose square is between 15 and 30 is -4.

The positive 4 also works, but -4 is smaller in value.

Clearly, they all make mistakes. At the moment, I'm unsure why they can't handle this simple problem. However, I plan to use DSPy to see if it can help DeepSeek correct this issue.

I still use COT module, but choose BootstrapFewShotwithOptuna as my optimizer. Let’s introduce it first.

BootstrapFewShotwithOptuna is an optimizer that merges the concepts of BootStrapFewShot and Optuna optimization. This approach improves the few-shot learning process by executing multiple trials to maximize evaluation metrics and by selecting the most suitable demonstrations for the task.

Let's examine how this method works effectively.

After optimizing with BootstrapFewShotwithOptuna, DeepSeek produces the correct answer, which is -5. But how does it achieve this? Let’s inspect the history.

For simplicity, you can use print(response) to view the model's reasoning process. Interestingly, when I examine the prompt for the model, it only uses a single example, and then DeepSeek provides the correct answer. This is impressive because I can't be sure if a manually written one-shot example would work as well on DeepSeek. However, after optimization, DSPy can automatically select the best example to prompt the model.

Conclusions & Limitations

In this blog, I explore the use of DSPy to tackle two challenging problems. Despite its power, I've merely scratched the surface of what DSPy can do, having tested only two optimization methods. I encourage you to learn more about DSPy on their official website and GitHub, where you'll find extensive documentation.

However, this blog has its limitations. Firstly, while DSPy is a robust, automatic programming optimization framework, it's not a panacea. The underlying Language Learning Model (LLM) significantly influences its effectiveness. So, consider using one of the most effective LLMs for complex problems.

Secondly, the two cases here are somewhat unique. Despite testing many models, including GPT-4, I didn't achieve satisfactory results. Particularly for the second case, originally posed by an X user, all the top-rated models failed to provide the correct answer as they overlooked negative numbers. Even after modifying the question slightly, I didn't get the right answer. However, less tricky problems may yield better results. Interestingly, when tested on Google's Gemini 1.5 Pro, it answered correctly.

Lastly, during the second test, the model initially provided the correct answer, but after optimization with DSPy, it was incorrect. This highlights the importance of both the optimization method and the training data when using DSPy. For simplicity, I used GSM8K, which is embedded internally, for this experiment.

In conclusion, I hope you can appreciate the magic of DSPy.