type
status
date
slug
summary
tags
category
password
icon
Author
Abstract

Job

In July, I helped my team build a private LLM benchmark. This benchmark is both confidential and tailored to our specific needs. The reason for creating this benchmark is that many public benchmarks have become contaminated as LLMs continue to develop. Despite claims from various LLMs that they surpass GPT-4 in areas like reasoning, I haven't experienced such outstanding performance in practice. Constructing the test set was challenging, and I drew inspiration from how many classical test sets are constructed. For evaluation, I read numerous papers and learned more about LLM-as-a-Judge.

Personal

This month, I conducted some interesting experiments with Midjourney, TextGrad, Dify, and DSPy. I've written several blog posts to document these experiences.
On a personal note, I started preparing for the PTE exam. Achieving a high score is not easy, and I plan to take the exam on August 8. I hope to perform well and achieve a good score.
August 17, Instruction Data Generation July 31, LLM/VLM-as-a-Judge
Loading...