type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
Table of Content
What is LLM-as-a-JudgeLLM-as-a-Judge Pros & Cons How to solve these limitations? Agreement Between LLM-as-a-Judge and Humans Application of LLM-as-a-Judge Length-Controlled Aplaca Eval LLM-as-a-Judge in Instruction Data Synthesis Some tips What is VLM-as-a-Judge VLM-as-a-Judge Pipeline VLM-as-a-Judge Pros & Cons Agreement Between VLM-as-a-Judge and Humans Conclusion References.
With the rapid development of LLMs, the community requires an efficient and accurate method to automatically evaluate LLM performance, as human annotation is tedious and time-consuming. LLM-as-a-Judge is now an optimized solution for this need.
In this blog, I will discuss this method, covering its pros and cons, and what to avoid based on several papers. This technique gained traction after GPT-4 was released, the first high-intelligence model capable of evaluating other models' outputs. Additionally, in the multimodal field, this method is effective. Numerous studies also validate the use of Vision-Language Models (VLMs) as a judge(VLM-as-a-Judge).
So, Let’s explore the LLM-as-a-judge first.
What is LLM-as-a-Judge
In simple terms, this concept involves using models to evaluate the performance of other models in tasks instead of relying on human labor. As model intelligence rapidly develops, we see models like GPT-4, which are highly aligned with human values, performing various tasks such as generation, summarization, writing, Q&A, and more. But how do these models handle evaluation? Chiang et al. (2023) demonstrate the potential of using LLMs in evaluation tasks. The core process is shown below:
The process mirrors how human evaluators assess tasks and is quite straightforward. We prompt the LLM to evaluate various attributes, each rated on a 5-point scale. This point-wise scoring is one of the most commonly used evaluation methods. We focus on key dimensions such as coherence, relevance, and truthfulness, prompting the LLM to assess each attribute, where 1 point indicates a poor outcome and 5 points indicate the best outcome. Typically, we allow LLMs to explain their scores to help identify any errors in the evaluation process.Another example of point-wise scoring from Lin et al. (2023) is shown below:
It employs a single prompt to effectively assess open-domain conversations based on several predefined key attributes using LLMs.
The other commonly used evaluation method is pairwise comparison. Zheng et al. (2023) utilized this approach to systematically study the LLM-as-a-judge in the MT-Bench and Chatbot Arena they developed. Let me introduce pairwise comparison first. This evaluation format involves the LLM judge being provided with a question and two answers, and tasked with determining which one is better or declaring a tie. Zheng et al. (2023) also presents two other LLM-as-a-Judge variations. One is point-wise scoring, referred to as Single Answer Grading in the paper, and the other is reference-guided grading, where the LLM judge is also presented with the reference answer to reduce potential bias during judgment. The prompts for these three methods are shown below:
There are notable findings regarding pairwise comparison from Chen et al. (2023). Their detailed analysis indicates that pairwise comparison using ChatGPT did not produce satisfactory results. Some experimental results are shown below:
They also examined if the unexpected results were due to poorly designed prompts by evaluating alternative prompts. However, changing the prompts did not enhance performance; in fact, it worsened it. See below:
Chen et al. (2023) suggest that the low quality of candidate texts is the primary reason for ChatGPT's inconsistent performance with pairwise comparison.
LLM-as-a-Judge Pros & Cons
Pros. The benefits of LLM-as-a-Judge are clear: scalability, explainability, and minimal human involvement. It enables rapid iteration in practice and simplifies the establishment of benchmarks for model performance evaluation. Additionally, it enhances interpretability, as the LLM judge can articulate the reasons behind its scores or preferences.
Cons. There are several limitations highlighted by Zheng et al. (2023): (i) Position bias, (ii) Verbosity bias, (iii) Self-enhancement bias, and (iv) limited capability in grading math and reasoning questions.
Let’s first examine position bias. This occurs when the LLM judge shows a preference for responses based on their position. Figure 6 provides an example of position bias.
As shown, GPT-4, as an LLM judge, demonstrates a preference for the first response. This limitation is not unique to GPT-4; many other advanced LLMs exhibit similar biases. See below:
As observed, only GPT-4 shows consistency in over 60% of cases. Claude-v1 demonstrates a strong inclination toward the first answer in over 75% of cases and also shows fluctuations when renaming the assistants in the default prompt.
The second limitation is verbosity bias. LLM judges tend to favor longer, verbose responses, even when these responses are ambiguous and inaccurate. It indicates that longer answers generally receive higher scores. Zheng et al. (2023) designed a “repetitive list” approach to make answers from MT-Bench unnecessarily longer to test this bias. The results are shown below:
“Failure rate” refers to the rate at which LLM judges cannot correctly choose the answer. As shown, all LLMs are prone to verbosity, with GPT-4 having an 8.7% failure rate. Claude-V1 and GPT-3.5 exhibit over 90% failure rates under this attack.
Self-enhancement bias is another significant limitation. It refers to the tendency of LLM judges to prefer answers they have generated. The experiment demonstrating this can be observed below:
It is evident that some LLMs show a preference for certain models when compared to human judgments. For instance, GPT-4 rates itself 10% higher, while Claude-v1 rates itself 25% higher. Additionally, these models also show favoritism towards other models.
Limited capability in grading math and reasoning questions is an interesting bias found by researchers. Even though we know that LLMs can solve many difficult reasoning and math problems that sometimes cannot be solved by humans. Below is an example to demonstrate this bias.
As Figure 10 shows, GPT-4 can answer the question if asked separately, but it can also make the wrong judgment when asked which response is correct.
How to solve these limitations?
Given that we are aware of the limitations of LLM-as-a-Judge, how can we address them?
There are several methods to address these limitations: (i) Swapping positions; (ii) Few-shot judge; (iii) Chain of thought and reference-guided judge.
Swapping positions is a straightforward method to tackle position bias. It involves calling an LLM judge twice by switching the order of two answers, and only considers an answer a win if it is selected both times. Inconsistent results are deemed a tie.
Few-shot judge is a highly effective prompt technique, also used in the LLM-as-a-judge framework. It helps enhance the consistency of judgments.
Chain-of-thought and reference-guided judge. Chain-of-thought is typically used to enhance reasoning capabilities and can be applied in the LLM-as-a-judge to improve grading of math and reasoning problems. However, it's important to note that Chain-of-thought is not a perfect solution for this limitation.
However, even with the CoT prompt, we find that in many cases LLM makes exactly the same mistake as the given answers in its problem-solving process. —from Zheng et al., 2023
Thus, the reference-guided method may be more effective in addressing this limitation. The experimental results are shown below:
As shown, with the reference-guided approach, the failure rate drops from 70% to 15%, while CoT still has a 30% failure rate. This indicates that reference-guided is the better option.
Fine-tuning personal judge model. Generally, using GPT-4 as the LLM judge is effective due to its intelligence and alignment with human values. However, costs can escalate with an increasing number of evaluations. Fine-tuning a smaller model can be a cost-effective alternative. There are many open-source LLMs available, such as Qwen, Yi, and others. Zheng et al. (2023) use Vicuna, a fine-tuned LLaMA-based model, as their personal LLM judge. How does it perform? Let’s look at the results:
It is clear that after fine-tuning, consistency significantly improves from 16.2% to 65%, and position bias decreases from 53.8% to 27.5%. In conclusion, a fine-tuned small-sized model shows great potential to replace the costly closed-source model. In fact, Wang et al. (2023) conducted similar research. They developed a LLM judge, PandaLM, which is trained to identify the superior model among several LLMs and is tuned from LLaMA. The results are shown in Figure 13.
The performance of PandaLM-70B exceeds that of GPT-4, showing that fine-tuning open-source models is an effective way to lower costs and improve evaluation performance, addressing limitations.
Agreement Between LLM-as-a-Judge and Humans
Understanding the alignment between LLM and human judgments is crucial for validating the effectiveness of LLM-as-a-Judge. Zheng et al. (2023) also design a research to study the agreement between GPT-4 and humans. It shows GPT-4 achieves the high agreement with human experts. The results can be seen below:
The agreement rate under setup S2 (without tie) between GPT-4 and humans is 85%, surpassing the 81% agreement rate among humans themselves. This indicates that GPT-4’s judgments closely align with the majority of human opinions.
Zhou et al. (2023) discovered that instruction-tuning on a smaller, high-quality dataset can outperform many larger models. They also demonstrated that LLM-as-a-Judge closely aligns with human evaluation. The experimental results are shown below:
However, Bavaresco et al. (2024) introduced JUDGE-BENCH to highlight the notable differences in agreement between LLM-as-a-Judge and human judges. They observed that while certain LLMs align closely with human judgments on some datasets, there are tasks where LLMs do not perform as well. Detailed results are presented in Figure 16, Figure 17 and Figure 18. See below:
As illustrated in Figure 16, GPT-4o ranks first in several evaluation scenarios, although it underperforms in some cases. The open models Llama3-70B and Mixtral-8x22B are relatively close in performance and even surpass GPT-4o in certain scenarios.
In Figure 17, all models show high correlations with the results annotated by non-experts, with GPT-4o performing best overall.
According to Figure 18, GPT-4o and Gemini-1.5 achieve the highest scores in acceptability and verbosity evaluations, while Mixtral-8x22B and Mixtral-8x7B exhibit the strongest correlations for coherence and consistency.
It is also important to note that all models align better with human judgments when evaluating human language compared to machine-generated text, for both categorical and graded data. Refer to Figure 19 below:
Finally, the authors highlight that although GPT-4 shows strong performance in judgment, but there are still instances where other models perform better. Evidence remains limited that state-of-the-art large language models are ready to replace human judges.
Application of LLM-as-a-Judge
There are still some significant researches on LLM-as-a-Judge. Below, I will introduce a few key studies.
Length-Controlled Aplaca Eval
Alpaca Eval is a tool that utilizes 805 fixed instructions to evaluate models based on their responses, similar to user interactions on the Alpaca web demo. It employs a GPT-4 turbo-based evaluator to compare the responses and determine the likelihood of preferring the model being tested. The win rate is calculated as the expected probability that the auto-evaluator prefers the evaluated model's responses to the 805 instructions, showcasing the chatbot's performance.
However, Alpaca Eval often favors models that provide longer answers (Verbosity bias), which can lead to biased results. To address this, researchers have developed a length-controlled version of AlpacaEval (Dubois et al., 2024). They use a Generalized Linear Model (GLM) to tackle this issue.
Specifically, they consider three attributes that affect the judgment:
- Model identity
- Length of output
- Instruction difficulty
Once the model is trained, it filters out the influence of terms believed to be spurious correlations with output quality, leaving only the true quality score. In this context, the length of output is removed from the regression, and the win rate is computed as usual, resulting in a length-controlled AlpacaEval score.
From the Figure 20, we can observe that AlpacaEval is highly sensitive to length. The baseline model (gpt4_1106_preview) shows fluctuations from 22.9 to 64.3 when varying the verbosity instruction in the prompt. However, in the length-controlled AlpacaEval, the sensitivity to length is much lower. GPT4_1106_preview only fluctuates from 41.9 to 51.6%.
LLM-as-a-Judge in Instruction Data Synthesis
Luo et al.(2024) refers the idea from MT-Bench and ChatBotArena( Zheng et al., 2023) and shows that LLM-as-a-Judge can be used not only in LLM evaluation benchmarks but also in data generation to evaluate instruction data quality. The overview of running example is as follows:
A judge model is equipped with dialogue history, user instructions, and responses from two LLMs to assess which response is better. It provides scores for each response, along with detailed explanations that focus on factors such as relevance, coherence, and factual accuracy. Each response receives an overall score on a scale from 1 to 10, where a higher score indicates better performance. To reduce position bias, Luo et al.(2024) use a two-stage setup, alternating the positions of the two responses.
Using LLM-as-a-judge to evaluate data quality is quite common. The recent technical report on Llama3 (Llama3 Team, 2024) from Meta demonstrates how they use this method to filter out low-quality data and improve model performance. They use the Llama3 checkpoint to rate each sample on a three-point scale for general English data (Accuracy, Instruction Following, and Tone) and a two-point scale for coding data (Bug Identification and User Intention).
Xu et al. (2023) also extensively used LLM-as-a-Judge in their experiments. They demonstrated a method for creating large amounts of instruction data with varying levels of complexity using LLMs instead of humans. They used ChatGPT to filter out low-quality data. The prompt they used are as follows:
Here are two Instructions to ChatGPT AI, do you think they are equal to each other, which meet the following requirements: 1. They have same constraints and requirments. 2. They have same depth and breadth of the inquiry. The First Prompt: <Here is first instruction.> The Second Prompt: <Here is second instruction.> Your Judgement (Just answer: Equal or Not Equal. No need to explain the reason.):
They also used GPT-4 as LLM judgement to evaluate WizardLM-7B. The results can be seen below:
WizardLM outperforms Alpaca-7B and Vicuna-7B on the Evol-Instruct test set by 6.2% and 5.8%, respectively. It also shows comparable performance with Vicuna-7B on the Vicuna test set. As shown in Figure 20(c), WizardLM surpasses Vicuna in difficulty levels and exceeds Alpaca in both easy and hard skills, reaching almost 88% of ChatGPT's capacity in hard skills.
Some tips
So, what should we do in practice when using LLM-as-a-Judge?
- First, decide which LLM to use for making judgments. Closed-source models like GPT-4 (or GPT-4o) are effective but costly. Alternatives include open-source models like Qwen2-72B-Instruct, Llama-3-1 70B (or even Llama-3-1-405B).
- Second, choosing the right judgment method is critical. Pairwise Comparison and Pointwise Comparison are both good options but differ in application. Select the method that best reflects the model’s performance for your business.
- Third, employ strategies to mitigate potential limitations, such as Chain of Thought (CoT), using different LLMs for evaluation, and swapping the positions of the responses.
- Finally, check the agreement between human judgment and LLM-as-a-Judge to ensure the outcome aligns with your expectations.
Now is the time to delve deeper into the Visual Language Model (VLM), also known as the Multimodal Language Model (MLLM).
What is VLM-as-a-Judge
Inspired by the innovative concept of LLM-as-a-judge, the VLM-as-a-Judge approach follows a similar methodology. This involves using a highly intelligent Visual Language Model (VLM) to evaluate and assess responses generated by other models. The goal is to leverage the advanced capabilities of VLMs to ensure a more accurate and comprehensive evaluation process, thereby improving the overall quality of assessments.
VLM-as-a-Judge Pipeline
Bai et al. (2023) developed a diverse visual dialogue evaluation dataset to demonstrate that GPT-4 can effectively evaluate the quality of VLM responses. The evaluation process is outlined below:
We observe that detailed image descriptions are obtained through manual annotation and inspection. These descriptions, along with questions, are input into GPT-4 (text-only) to generate reference answers. Meanwhile, various VLMs use visual signals and questions to generate answers directly. The generated answers, reference answers, questions, and detailed descriptions are all evaluated by GPT-4. The final scores are averaged and used to rank the models, representing their overall performance. The authors consider the usefulness, relevance, and accuracy of the answers as the key attributes for evaluation.
The primary function of this pipeline is to transform information from various modalities into text through detailed annotations. This enables advanced language models to independently evaluate dialogue quality, shifting the role from VLM-as-a-judge to LLM-as-a-judge.
Below is an example demonstrating how GPT-4 evaluates responses and identifies hallucination situations in context.
From this example, we can clearly see that this approach is essentially the same as the LLM-as-a-judge method that utilizes pointwise scoring. Given this observation, it is reasonable to conclude that these two methods are fundamentally identical in their application and outcomes. By comparing them closely, we can understand that the principles and processes involved in both are overlapping, leading us to think of them as equivalent methods in practice.
Chen et al. (2024) presented a comparable VLM-as-a-Judge pipeline, as illustrated below:
The process described is similar to Figure 23, with three key differences.
- First, Chen et al. (2024) refer to this pipeline as MMLM-as-a-Judge, which appears to be the same as VLM-as-a-Judge.
- Second, Chen et al. (2024) also conducted a pairwise comparison and batch ranking within the pipeline.
- Third, the authors did not convert information from the images to text and did not use the LLMs as evaluators. They only utilized the multimodal models in the pipelines.
This process includes three steps:
- Image-Instruction Pair Collection: Chen et al. (2024) created a curated dataset of 4,414 image-text pairs from various downstream task datasets.
- MLLM Response Collection: The authors used six popular MLLMs - GPT-4V, Gemini, LLaVA, Qwen-VL-Max, LLaVA-1.6-34b, and CogVLM - to generate responses based on the image-instruction pairs.
- Comparison with Human Annotations: The authors employed the VLM-as-a-Judge method and evaluated the agreement between VLM judgments and human annotations.
Similar to LLM-as-a-Judge, VLM-as-a-Judge utilizes two judgment methods: Scoring Evaluation and Pair Comparison. For clarity, we will refer to these as pointwise scoring and pairwise comparison. Additionally, there is another method called batch ranking that is not used in LLM-as-a-Judge. This method systematically arranges the responses in descending order of quality based on a given instruction, without allowing for any ties.
The prompts for the three methods are as follows:
You may be interested in understanding which method performs better in real-world scenarios. Let's explore the advantages and disadvantages of VLM-as-a-Judge to evaluate each method’s performance, much like our previous discussion on LLM-as-a-Judge.
VLM-as-a-Judge Pros & Cons
The limitations present in LLM-as-a-Judge also need to be carefully considered in VLM-as-a-Judge. To mitigate the influence of answer positioning, Bai et al. (2023) conduct a second scoring round by swapping the positions of answers and then compute the average of the two scores obtained.
To ensure reliability as a judge, Chen et al. (2024) conducted thorough tests. They examined the consistency comparison between GPT-4V and Gemini to identify any potential bias in VLM-as-a-Judge. The results are shown below:
As shown in Figure 27, GPT-4V outperforms Gemini in all tasks. Specifically, in pair-wise comparison, GPT-4V achieves a higher consistency of 0.675. However, it struggles to maintain the same levels of consistency in point-wise scoring and batch ranking.
From Figure 28, there's an interesting finding that CoT may not be the right method to align more closely with human performance but a way to reduce hallucination. This approach even reduces judging performance on many datasets. This finding is also demonstrated by Zheng et al. (2023) in the LLM-as-a-Judge method.
There are other some bias and hallucination exists in the VLM-as-a-Judge.
Egocentric Bias is the same as self-enhancement in the LLM-as-a-Judge. It means models tend to assign higher scores to their own responses while others lower. Figure 26 demonstrates this.
GPT-4V exhibits a slight degree of self-preference, aligning its judgments with its predefined ethical guidelines. For example, GPT-4V consistently emphasizes privacy preservation, leading to higher scores for privacy-related questions based on its own metrics. This limitation cannot be addressed through prompt engineering to ensure neutrality, as the model still relies on judgment criteria set during post-alignment training.
Position Bias and Length Bias are common limitations in LLM-as-a-Judge and VLM-as-a-Judge. To mitigate Position Bias, introducing multiple examples in prompts can be effective. Regarding Length Bias, Figure30 shows that both GPT-4V and Gemini tend to assign higher scores to longer content.
When using batch ranking, there exists higher frequency of hallucinations, compared to pair-wise comparison and point-wise scoring.
Agreement Between VLM-as-a-Judge and Humans
Chen et al. (2024) demonstrates judgements made by GPT-4V are closer to human annotations among all setting. The experiment results are below:
As shown in Figure 32, VLM-as-a-judge performs better in pairwise comparison but underperforms in pointwise scoring and batch ranking. In pointwise scoring, GPT-4V shows the highest similarity to human scoring with a similarity score of 0.49. In contrast, Gemini achieves only 0.304, with LLaVA and CogVLM scoring even lower. To further investigate this result, the authors present the distribution of the score result density, which is illustrated as follows:
As shown in Figure 33 (right), Gemini, LLaVA, and CogVLM tend to give scores around 4 points, rarely awarding 1 or 2 points. This might be due to an imbalance in positive and negative judging instructions in their training data. GPT-4V’s scores are more evenly distributed and align closely with human preferences.
In pairwise comparison, depicted in Figure 33 (Left) and Figure 32, GPT-4V clearly outperforms other VLMs, indicating strong alignment with human preferences.
In batch ranking, GPT-4V is more closely aligned with human ranking results. However, there is still significant room for improvement in this area for all VLMs.
Chen et al. (2024) showed the agreement between VLM judgments and human annotations for GPT-4V, Gemini, and other public models. On the other hand, Lee et al. (2024) fine-tuned a personal model, PROMETHEUS-VISION, for VLM judgments and demonstrated its correlation with human annotations. The correlation results are as follows:
As demonstrated through extensive testing and analysis, while PROMETHEUS-VISION slightly surpasses GPT-3.5-TURBO and PROMETHEUS 13B in terms of correlation on the VISIT-BENCH, it still falls short when compared to the more advanced models, GPT-4 and GPT-4V. This suggests that while PROMETHEUS-VISION shows promise and improvement over earlier iterations, there is still significant room for development and enhancement to reach the high performance levels exhibited by GPT-4 and GPT-4V.
Therefore, similar to LLM-as-a-Judge, fine-tuning a personalized model is an option for selecting GPT-4v for VLM judgments. Unlike LLM-as-a-Judge, where many open-sourced models are available, the only model you can choose for VLM-as-a-Judge for reliable judgment is GPT-4V. Other visual models still lag behind.
Conclusion
In this blog, I explore the LLM-as-a-Judge and VLM-as-a-Judge evaluation methods as discussed in a recent paper. I find these two methods to be quite similar, sharing several common limitations that need to be addressed. Despite this, GPT-4 from OpenAI remains a promising choice as an evaluator. In real-world business scenarios, both LLM-as-a-Judge and VLM-as-a-Judge can be utilized effectively.
For LLM-as-a-Judge, I recommend using point-wise scoring and pair-wise comparison. For VLM-as-a-Judge, pair-wise comparison seems to be the better option. Regardless of the method you choose, it is important to check the agreement between human judgment and model judgments to ensure the outcome aligns with your expectations.
References.
- Chiang, Lee, et al. “Can Large Language Models Be an Alternative to Human Evaluations?” arXiv preprint arXiv:2305.01937(2023)
- Lin, Chen, et al. “LLM-Eval:Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models.” arXiv preprint arXiv:2305.13711(2023)
- Zheng, Chiang, et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv preprint arXiv:2306.05685(2023)
- Zhou, Liu, et al. “LIMA: Less Is More for Alignment.” arXiv preprint arXiv:2306.11206(2023)
- Wang, Yu, et al. “PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.” arXiv preprint arXiv:2306.05087(2023)
- Bavaresco, Bernardi, et al. “LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.” arXiv preprint arXiv:2406.18403(2024)
- Dubois, Galambosi, et al. “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.” arXiv preprint arXiv:2404.04475(2024)
- Luo, Sun, et al. “Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena.” arXiv preprint arXiv:2407.10627(2024)
- Xu, Sun, et al. “WizardLM: Empowering Large Language Models to Follow Complex Instructions.” arXiv preprint arXiv:2304.12244(2023)
- Llama 3 Team, et al. “The Llama 3 Herd of Models.” https://ai.meta.com/research/publications/the-llama-3-herd-of-models/(2024)
- Bai, Yang, et al. “TouchStone: Evaluating Vision-Language Models by Language Models” arXiv preprint arXiv:2308.16890(2023)
- Chen, Chen, et al. “MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark” arXiv preprint arXiv:2402.04788(2024)
- Lee, Kim, et al. “Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation” arXiv preprint arXiv:2401.06591(2024)
- Chen, Wang, et al. “Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study” arXiv preprint arXiv:2304.00723
- Author:Chengsheng Deng
- URL:https://chengshengddeng.com/article/LLM-as-aJudge
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts