Nov 26, Explore DSPy on BootstrapFinetune

type

status

date

slug

summary

Load the Dataset

For simplicity, I'm using the MMLU-Pro benchmark dataset, which I previously worked with in an earlier blog post.

I split the dataset into a trainset of the first 50 examples and a validation set of 299 examples. Let's examine what this data looks like.

The data is stored in an Example object, from which we can easily extract the question and answer.

<ins/>

Configure LM

When using BootstrapFinetune, it appears that gpt-4o-mini-2024-0718 is the only compatible language model. (I've tested several models, and gpt-4o-mini-2024-0718 is the only one that works successfully.)

Here is how to configure the model.

Let's test if the model works.

Construct a Customized Module

Next, we'll design a module that specifies how the model should process and answer questions.

Let's test our SimpleQA module.

Metric Function

After comparing the predicted and true answers, I found they contain identical content in different formats. This means I need to create a metric function to accurately evaluate when they are equivalent.

Fine-tune LM

Now let's begin the fine-tuning process with the following code:

If everything works correctly, you'll see logs being generated. For simplicity, I won't show them here.

Evaluate

Let's evaluate the performance of both our optimized model and the original SimpleQA module to compare their effectiveness.

We'll start by testing the optimizer's performance.

After running on the valset, the model achieves an accuracy of 68.56%, correctly answering 205 out of 299 questions.

Now let's evaluate our original SimpleQA module:

The results show an accuracy of 67.56%, with the model correctly answering 202 out of 299 questions.

Insights

The performance improvement is not particularly significant. There are several potential reasons for this:

Training for only 1 epoch may not give the model sufficient time to learn effectively.

The training set is quite limited, containing just 50 questions, with only 33 correct answers — this may not provide enough examples for proper learning.

The fine-tuning approach has inherent limitations. The model may be overfitting to the small training set, resulting in minimal performance gains.

The evaluation metric may require further refinement to better handle various answer formats and provide more accurate performance measurements.

<ins/>