In August, I focused on fine-tuning the Qwen2-7b model and evaluating its performance on our private benchmark consisting of over 200 questions and answers. I evaluated various large language models (LLMs) like GPT-4, Gemini 1.5-Pro, and Llama 3-405b on this benchmark to compare their capabilities in areas such as reasoning, coding, and commonsense.
In July, I helped my team build a confidential LLM benchmark tailored to our needs due to contamination in public benchmarks. Despite claims, I haven't seen LLMs surpass GPT-4 in practice. Constructing the test set was challenging, and I learned about LLM-as-a-Judge for evaluation. Personally, I experimented with Midjourney, TextGrad, Dify, and DSPy, documenting my experiences in blog posts. Additionally, I started preparing for the PTE exam, aiming for a high score on August 8.
This month has been emotionally intense, marked by a series of intriguing and unfortunate events. Many instances sparked curiosity and inspiration, while others sadly brought about sorrow and anger. It's truly been a month full of diverse experiences.
This month at work mainly focused on completing a few tasks: explored more possibilities of using Prompt Chain, using Prompt Chain to write stories, can generate a pretty good story.