type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
This post builds upon my previous blog of GPT-4o-mini's performance on MMLU Pro using
BootstrapFewShotWithRandomSearch
and BootstrapFewShotWithOptuna
. In this continuation, I will examine the newly introduced optimizers, MIPRO and MIPROV2, to assess their optimization capabilities and determine the potential performance enhancements they may bring to GPT-4o-mini. Some of the codes are from my previous blog. The full code is following:
Setup
To utilize DSPy effectively, please ensure you have your OPENAI_API_KEY prepared and proceed with the installation of the DSPy package by following these instructions:
Having completed these steps, you will be well-positioned to harness the advanced optimization functionalities offered by DSPy.
Loading Dataset
For simplicity, I've chosen MMLU-Pro as the dataset for this test. Let's load the dataset and examine it.
When you execute this code, it will output the following:
Now, take a look to the dataset
This code will display the structure of the first data sample in the trainset:
Answer the Question with CoT
Now, let's define the Chain of Thought (CoT) class and establish our evaluation metric.
In this section of code, I've established the CoT class, defined an evaluation metric, and set up the evaluation process. These components work together to assess GPT-4o-mini's performance in answering questions using the chain of thought approach.
The final result is
70.0
. It demonstrates the GPT-4o-mini has 70.0%
accuracy on the MMLU-Pro subset. What is MIPRO?
Before executing the MIPRO code, let's introduce the MIPRO optimizer. MIPRO is a sophisticated tool designed to optimize prompts for language model programs. It enhances these programs' performance by refining both the instructions and the few-shot demonstrations used in the prompts. This dual optimization ensures that the prompts are more effective and tailored to the language model programs' specific tasks.
Although MIPRO V2 is the newly released optimizer, I'll first explore MIPRO V1 (simply referred to as MIPRO). There are several important arguments you should know when using MIPRO:
prompt_model
is responsible for generating and refining the prompts. It creates new instructions and few-shot examples for use in the pipeline, essentially crafting the prompts that guide the task model.
task_model
is the model that actually performs the tasks. It uses the prompts generated by theprompt_model
to execute tasks such as answering questions or generating text.
num_candidates
is the number of new prompts generated during the optimization process. It determines how many candidate prompts will be evaluated to find the best-performing one based on the specified metric.
Optimization with MIPRO V1 and Evaluation
First, we need to set the
prompt_model
. Since this model is responsible for refining the prompts, I've chosen chatgpt-4o-latest as the prompt_model
. Then, I set num_candidates=5
. The code is as follows:After executing this code, we'll observe the optimization process. It appears as follows:
Let’s evaluate the optimized result.
The accuracy is
77.0
and the output is following: We can see that the accuracy increased from
70.0
to 77.0
. This is a significant improvement. From my previous blog, we know that BootstrapFewShotWithOptuna
only achieved 73.0
performance. Therefore, MIPRO is clearly the superior optimizer.So, what about MIPRO V2?
<ins/>
Optimization with MIPRO V2 and Evaluation
MIPROV2 is an upgrade from MIPRO V1, offering more intelligence and cost-efficiency. Here are some key hyperparameters to consider:
num_candidates
: The number of instructions and few-shot examples to generate and optimize over.
num_batches
: The number of optimization batches to run.
Let's proceed with the following code:
This code will output the optimization process. Since the process is lengthy and for simplicity's sake, I won't display it here. Please refer to the code on Colab for the full output.
Now, let's evaluate the result optimized by MIPRO V2.
The accuracy is
78.0
even higher than MIPRO and the output is following: Conclusion
In this blog, I explored the newly released optimizers in DSPy, MIPRO V1 and MIPRO V2, and demonstrated their effectiveness as optimization methods. GPT-4o-mini achieved a score of
78.0
on MMLU-Pro with MIPROV2, potentially surpassing GPT-4's performance—though it's important to note that this was based on a test of only 100 questions. MIPRO also showed significant improvement, achieving a score of 77.0
.These results highlight the potential of advanced optimizers like MIPRO V1 and MIPRO V2 in enhancing the performance of language models, even those with fewer parameters like GPT-4o-mini. The significant improvements observed suggest that these optimizers could be valuable tools for researchers and practitioners looking to maximize the capabilities of their language models. However, further testing on larger datasets would be necessary to confirm these promising initial findings.
<ins/>
- Author:Chengsheng Deng
- URL:https://chengshengddeng.com/article/gpt-4o-mini-dspy-mipro
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts