Aug 21, GPT-4o-mini with DSPy MIPRO on MMLU-Pro

type

status

date

slug

summary

Setup

To utilize DSPy effectively, please ensure you have your OPENAI_API_KEY prepared and proceed with the installation of the DSPy package by following these instructions:

Having completed these steps, you will be well-positioned to harness the advanced optimization functionalities offered by DSPy.

Loading Dataset

For simplicity, I've chosen MMLU-Pro as the dataset for this test. Let's load the dataset and examine it.

When you execute this code, it will output the following:

Now, take a look to the dataset

This code will display the structure of the first data sample in the trainset:

Answer the Question with CoT

Now, let's define the Chain of Thought (CoT) class and establish our evaluation metric.

In this section of code, I've established the CoT class, defined an evaluation metric, and set up the evaluation process. These components work together to assess GPT-4o-mini's performance in answering questions using the chain of thought approach.

The final result is 70.0 . It demonstrates the GPT-4o-mini has 70.0% accuracy on the MMLU-Pro subset.

What is MIPRO?

Before executing the MIPRO code, let's introduce the MIPRO optimizer. MIPRO is a sophisticated tool designed to optimize prompts for language model programs. It enhances these programs' performance by refining both the instructions and the few-shot demonstrations used in the prompts. This dual optimization ensures that the prompts are more effective and tailored to the language model programs' specific tasks.

Although MIPRO V2 is the newly released optimizer, I'll first explore MIPRO V1 (simply referred to as MIPRO). There are several important arguments you should know when using MIPRO:

prompt_model is responsible for generating and refining the prompts. It creates new instructions and few-shot examples for use in the pipeline, essentially crafting the prompts that guide the task model.

task_model is the model that actually performs the tasks. It uses the prompts generated by the prompt_model to execute tasks such as answering questions or generating text.

num_candidates is the number of new prompts generated during the optimization process. It determines how many candidate prompts will be evaluated to find the best-performing one based on the specified metric.

Optimization with MIPRO V1 and Evaluation

First, we need to set the prompt_model. Since this model is responsible for refining the prompts, I've chosen chatgpt-4o-latest as the prompt_model. Then, I set num_candidates=5. The code is as follows:

After executing this code, we'll observe the optimization process. It appears as follows:

Let’s evaluate the optimized result.

The accuracy is 77.0 and the output is following:

We can see that the accuracy increased from 70.0 to 77.0. This is a significant improvement. From my previous blog, we know that BootstrapFewShotWithOptuna only achieved 73.0 performance. Therefore, MIPRO is clearly the superior optimizer.

So, what about MIPRO V2?

<ins/>

Optimization with MIPRO V2 and Evaluation

MIPROV2 is an upgrade from MIPRO V1, offering more intelligence and cost-efficiency. Here are some key hyperparameters to consider:

num_candidates: The number of instructions and few-shot examples to generate and optimize over.

num_batches: The number of optimization batches to run.

Let's proceed with the following code:

This code will output the optimization process. Since the process is lengthy and for simplicity's sake, I won't display it here. Please refer to the code on Colab for the full output.

Now, let's evaluate the result optimized by MIPRO V2.

The accuracy is 78.0 even higher than MIPRO and the output is following:

Conclusion

In this blog, I explored the newly released optimizers in DSPy, MIPRO V1 and MIPRO V2, and demonstrated their effectiveness as optimization methods. GPT-4o-mini achieved a score of 78.0 on MMLU-Pro with MIPROV2, potentially surpassing GPT-4's performance—though it's important to note that this was based on a test of only 100 questions. MIPRO also showed significant improvement, achieving a score of 77.0.

These results highlight the potential of advanced optimizers like MIPRO V1 and MIPRO V2 in enhancing the performance of language models, even those with fewer parameters like GPT-4o-mini. The significant improvements observed suggest that these optimizers could be valuable tools for researchers and practitioners looking to maximize the capabilities of their language models. However, further testing on larger datasets would be necessary to confirm these promising initial findings.

<ins/>