type
status
date
slug
summary
tags
category
password
icon
Author
Abstract

Job

In July, I helped my team build a private LLM benchmark. This benchmark is both confidential and tailored to our specific needs. The reason for creating this benchmark is that many public benchmarks have become contaminated as LLMs continue to develop. Despite claims from various LLMs that they surpass GPT-4 in areas like reasoning, I haven't experienced such outstanding performance in practice. Constructing the test set was challenging, and I drew inspiration from how many classical test sets are constructed. For evaluation, I read numerous papers and learned more about LLM-as-a-Judge.

Personal

This month, I conducted some interesting experiments with Midjourney, TextGrad, Dify, and DSPy. I've written several blog posts to document these experiences.
On a personal note, I started preparing for the PTE exam. Achieving a high score is not easy, and I plan to take the exam on August 8. I hope to perform well and achieve a good score.
Relate Posts
August 17, Instruction Data Generation July 31, LLM/VLM-as-a-Judge
Loading...
Catalog
0%
Chengsheng Deng
Chengsheng Deng
Chengsheng Deng
Latest posts
Mar 24 Notes on LightRAG
Mar 24, 2025
Dec 6, Some Tests on o1
Mar 14, 2025
Mar 10, Note on BIG-MATH
Mar 10, 2025
Mar 6, Note on QwQ-32B
Mar 6, 2025
Jan 21, Notes on DeepSeek-R1
Mar 6, 2025
The First Pages of 2025 - My January & February Story
Mar 5, 2025
Announcement
🎉Welcome to my blog🎉 
To find me:
Twitter/X:My X
👏Have fun in my blog👏
 
Catalog
0%