type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
While this isn't the first blog about DSPy, I've noticed recent updates to the DSPy documentation and GitHub repository, including a new optimization method called
BootstrapFinetune
. So, in this blog, let’s explore this new optimizer —
BootstrapFinetune
. Load the Dataset
For simplicity, I'm using the MMLU-Pro benchmark dataset, which I previously worked with in an earlier blog post.
I split the dataset into a trainset of the first 50 examples and a validation set of 299 examples. Let's examine what this data looks like.
The data is stored in an
Example
object, from which we can easily extract the question and answer. <ins/>
Configure LM
When using
BootstrapFinetune
, it appears that gpt-4o-mini-2024-0718
is the only compatible language model. (I've tested several models, and gpt-4o-mini-2024-0718
is the only one that works successfully.) Here is how to configure the model.
Let's test if the model works.
Construct a Customized Module
Next, we'll design a module that specifies how the model should process and answer questions.
Let's test our
SimpleQA
module.Metric Function
After comparing the predicted and true answers, I found they contain identical content in different formats. This means I need to create a metric function to accurately evaluate when they are equivalent.
Fine-tune LM
Now let's begin the fine-tuning process with the following code:
If everything works correctly, you'll see logs being generated. For simplicity, I won't show them here.
Evaluate
Let's evaluate the performance of both our optimized model and the original
SimpleQA
module to compare their effectiveness. We'll start by testing the optimizer's performance.
After running on the
valset
, the model achieves an accuracy of 68.56%, correctly answering 205 out of 299 questions. Now let's evaluate our original
SimpleQA
module:The results show an accuracy of 67.56%, with the model correctly answering 202 out of 299 questions.
Insights
The performance improvement is not particularly significant. There are several potential reasons for this:
- Training for only 1 epoch may not give the model sufficient time to learn effectively.
- The training set is quite limited, containing just 50 questions, with only 33 correct answers — this may not provide enough examples for proper learning.
- The fine-tuning approach has inherent limitations. The model may be overfitting to the small training set, resulting in minimal performance gains.
- The evaluation metric may require further refinement to better handle various answer formats and provide more accurate performance measurements.
<ins/>
- Author:Chengsheng Deng
- URL:https://chengshengddeng.com/article/dspy-on-bootstrapfinetune
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts