July 23, DSPy with GPT-4o-mini on MMLU-Pro

type

status

date

slug

summary

Setup

To use DSPy, please ensure you have your OPENAI_API_KEY ready and install the DSPy package by following these steps:

With these preparations, you’ll be ready to leverage DSPy’s powerful optimization capabilities.

MMLU-Pro dataset

MMLU-Pro(Wang et al., 2024) is an advanced dataset that builds on the primarily knowledge-based MMLU benchmark. It introduces more complex, reasoning-intensive questions and increases the number of answer choices from four to ten.

The code for loading dataset from HuggingFace:

After executing this code, you will see the progress of the dataset download and the total number of datasets.

For convenience and simplicity, I am not using the entire dataset. Instead, I will use only the first 200 entries for this test.

Now, let's take a look at the dataset.

Evaluation Metric

It's time to define the evaluation metric. This metric will determine if the model's responses match the true answers. To achieve this, I created a function to perform this check.

Evaluation Pipeline

Now, the evaluation pipeline needs to be set up.

CoT module

Actually, DSPy offers many built-in modules. Here, we use CoT because it is an effective and straightforward method to improve model performance.

Evaluation

After setting up the module, evaluation pipeline, and evaluation metric, we can proceed with the evaluation to see how the model performs on the test set.

It will output the final metric for the model's responses. For simplicity, I will not display the detailed output table of the evaluation, only the final result.

Optimization & Evaluation

There are many optimization methods to choose from in DSPy. To compare their differences, I have selected BootstrapFewShotWithRandomSearch, and BootstrapFewShotWithOptuna.

BootstrapFewShotWithRandomSearch

For simplicity, the output for this code is shown in [[#A.]]. There are some important hyperparameters to note in this method:

max_labeled_demos: the number of demonstrations randomly selected from the train set.

max_bootstrapped_demos: the number of additional examples generated by the teacher model.

num_candidate_programs: the number of random programs evaluated during the optimization.

Now, we can evaluate the optimized model to determine if the accuracy has improved.

The details of the evaluation for the optimized model are in [[#B]]. It is evident that there is a significant improvement in the model's response accuracy, increasing from 0.66 to 0.75.

BootstrapFewShotWithOptuna

This method is very similar to BootstrapFewShot. It applies BootstrapFewShot with Optuna optimization across demonstration sets, running trials to maximize evaluation metrics and selecting the best demonstrations.

Now, let's evaluate the optimized_cot_qa_optuna model to see its performance.

The details of the evaluation for this optimized model are in [[#C]]. As we can see, the accuracy also improved from 0.66 to 0.69, although it is still lower than the BootstrapFewShotWithRandomSearch method.

Conclusion

In this blog, I demonstrated how to use DSPy with GPT-4o-mini on custom datasets. The results show that DSPy can significantly enhance the model's performance. According to the latest release, GPT-4o-mini achieved an overall score of 63.09 without DSPy. However, with DSPy, it reached an overall score of 75, even though I tested only 100 questions.

This blog is not intended to compare the optimization methods in DSPy to determine which is the best. Instead, it aims to show you how to use them. If you're unsure which method to start with, I recommend trying BootstrapFewShotWithRandomSearch first. It is a powerful tool.

Appendix

A.

B.