📬Sep 1, Recap for August

In August, I focused on fine-tuning the Qwen2-7b model and evaluating its performance on our private benchmark consisting of over 200 questions and answers. I evaluated various large language models (LLMs) like GPT-4, Gemini 1.5-Pro, and Llama 3-405b on this benchmark to compare their capabilities in areas such as reasoning, coding, and commonsense.
Sep 1, Recap for August
Aug 26, Flux + LoRA

Lazy loaded imageAug 21, GPT-4o-mini with DSPy MIPRO on MMLU-Pro

This post builds upon my previous blog of GPT-4o-mini's performance on MMLU Pro using BootstrapFewShotWithRandomSearch and BootstrapFewShotWithOptuna. In this continuation, I will examine the newly introduced optimizers, MIPRO and MIPROV2, to assess their optimization capabilities and determine the potential performance enhancements they may bring to GPT-4o-mini.
Aug 21, GPT-4o-mini with DSPy MIPRO on MMLU-Pro
August 19, Summarize Web Page Content with Claude3

Lazy loaded imageAugust 17, Instruction Data Generation

More researchers are recognizing the significance of instruction data during the Supervised Fine-Tuning (SFT) stage. In June, I wrote a blog about data generation, but I believe it was somewhat superficial and insufficient. Since then, many new methods have emerged. Therefore, I aim to cover more papers I've read to discuss instruction data generation and selection.
August 17, Instruction Data Generation

Lazy loaded imageAugust 1, Recap for July

In July, I helped my team build a confidential LLM benchmark tailored to our needs due to contamination in public benchmarks. Despite claims, I haven't seen LLMs surpass GPT-4 in practice. Constructing the test set was challenging, and I learned about LLM-as-a-Judge for evaluation. Personally, I experimented with Midjourney, TextGrad, Dify, and DSPy, documenting my experiences in blog posts. Additionally, I started preparing for the PTE exam, aiming for a high score on August 8.
August 1, Recap for July
July 31, LLM/VLM-as-a-Judge

Lazy loaded imageJuly 23, DSPy with GPT-4o-mini on MMLU-Pro

DSPy is an optimization framework that enhances prompts and responses from models like GPT-4o-mini. It showcases the magic of the framework and demonstrates how to use its powerful optimizers to improve the cost-effective model. The MMLU-Pro dataset is an advanced dataset with complex questions and increased answer choices. The evaluation metric is defined to check if the model's responses match the true answers.
July 23, DSPy with GPT-4o-mini on MMLU-Pro
July 23, Test with Chameleon From Meta

Lazy loaded imageJuly 16, LLMs Evals Thoughts

Evaluating LLMs is important for understanding their abilities and solving real business problems. A good evaluation requires sufficient and high-quality data samples, clear judging criteria, meaningful evaluation tasks, and frequent private benchmarks. The process should adapt to the development of LLMs over time.
July 16, LLMs Evals Thoughts

Lazy loaded imageJuly 5, LLMs Evaluation Benchmarks

As the capabilities of Large Language Models (LLMs) continue to evolve, many traditional evaluation benchmarks may require updates. With the rapid progress of these models, researchers are increasingly introducing new evaluation datasets. However, the specific dimensions these datasets assess in the models are often unclear. In this blog, I will explore a series of commonly referenced evaluation datasets and highlight the particular aspects of model capabilities they were designed to assess even though I may not cover all available datasets.
July 5, LLMs Evaluation Benchmarks

🧡July 7, Weekend with Midjourney

Midjourney provides a platform for exploring different artistic styles and techniques. Whether you're a seasoned artist or a beginner, the tool offers a wide array of options to experiment with and refine your artistic vision. Users can blend various elements, adjust parameters, and see real-time changes, giving them a unique and interactive experience.
July 7, Weekend with Midjourney