ChengshengDeng

本文为 Langchain 官方博客“The rise of context engineering”的双语对照版本。

 上下文工程的兴起 

MiniMax 本周仿照起了OpenAI 和 DeepSeek开启了一次为期 5 天的发布周，发布了不少干货。
正好，趁着周末的时间，整理回顾了一波这一周他们究竟发布了哪些猛货。
那我们就直接开始！

MiniMax 发布周回顾 

In this note, I’ll document some critical implementation details about the LightRAG framework that I discovered while deploying it in production.

Mar 24 Notes on LightRAG

I recently read a detailed and comprehensive paper on preparing high-quality datasets, and I’m summarizing key points here for future reference.

Mar 10, Note on BIG-MATH

Alibaba has officially released the production version of their QwQ-32B model.

Mar 6, Note on QwQ-32B

This post reflects on the past two months and captures my thoughts about life and the rapidly evolving AI landscape.

The First Pages of 2025 - My January & February Story

it is about the new model from xai, grok3

Feb 20, Notes on Grok3

In this blog, I will explore the Policy Gradient algorithm. This algorithm is a fundamental approach in reinforcement learning fields

Feb 8, Notes on Policy Gradient 

Yesterday, I saw an interesting tweet from @Mahesh. His team introduced Bespoke-Stratos-32B, a model distilled from DeepSeek-R1 using Berkeley NovaSky's Sky-T1 recipe. I quickly read their blog post and reviewed Berkeley's recipe to take some notes.  

Jan 23, Notes on Bespoke and NovaSky 

DeepSeek-AI has open-sourced their deep-thinking model, R1. Having read the paper and tested it myself, I'll share my notes about this new model. 

Jan 21, Notes on DeepSeek-R1   

In this blog, I will explore two famous reinforcement learning algorithms: SARSA and Q-Learning. 

Jan 21, Notes on Sarsa & Q-Learning 

Black Forest Labs has announced the launch of the FLUX Pro Finetuning API, bringing unprecedented customization capabilities to the flagship FLUX Pro model. 

Jan 19,  Notes on FLUX Pro Finetuning API 

Today is January 2, 2025, and I'm writing this recap for December—or more precisely, for all of 2024. 

Jan 2, Recap for December  

Exploration of three key reinforcement learning algorithms: Dynamic Programming (DP) for optimal policies in MDPs, Monte Carlo methods for learning from complete episodes without a model, and Temporal Difference (TD) learning for efficient updates from incomplete episodes using bootstrapping. Each method has unique characteristics and trade-offs essential for understanding advanced concepts in reinforcement learning.

Dec 16, Notes on DP, Monte Carlo, TD in Reinforcement Learning    

This brief blog post covers key concepts in Reinforcement Learning. Understanding these fundamentals is essential for mastering the field. 

Dec 13, Notes on Basic Concepts about Reinforcement Learning  

Google has announced Gemini 2.0, launching Gemini Flash 2.0 as its first model in this new series.

Dec 12, Notes on Gemini-Flash 2.0

In this blog, I will show some test examples of the full o1 model and compare it with other deep thinking models. Let's start! 

Dec 6, Some Tests on o1  

This document summarizes my experiences in both work and personal life during November

Dec 4, Recap for November 

Since OpenAI released its "o1-series" model, several teams have developed their own approaches to "deep thinking" models. DeepSeek introduced their o1-like model, DeepSeek-R1-Lite, while Qwen released QwQ-32B-Preview, and Intern launched Intern Thinker. 

Nov 29, Notes on “Deep Thinking Model”  

While this isn't the first blog about DSPy, I've noticed recent updates to the DSPy documentation and GitHub repository, including a new optimization method called BootstrapFinetune.

Nov 26, Explore DSPy on BootstrapFinetune

This isn't my first blog post on DSPy—I've written several before. However, I've noticed some recent updates to DSPy, and I'd rather not consult the documentation every time I want to build programs. So, I plan to jot down some basic DSPy concepts in this post. Additionally, I intend to use this document as external knowledge for GPT or Claude.

Nov 20, Notes on DSPy  

This document summarizes my experiences in both work and personal life during October

Nov 17, Recap for October  

In this blog, I share notes on an intriguing paper I recently read:

Nov 15，Notes on OPENCODER

Contextual Retrieval, a method proposed by Anthropic, significantly enhances the retrieval step in RAG systems.

Nov 6, Notes on Contextual Retrieval

The blog introduces a novel method for evaluating LLM performance by having them play the Snake game, assessing their decision-making, planning, and strategy skills. The experiment tested several models, revealing that o1-mini performed best with a score of 11, while Claude models outperformed GPT models. The findings suggest that reinforcement learning significantly enhances LLMs' capabilities in dynamic decision-making tasks. Although preliminary, this approach highlights the potential of game-based assessments for deeper insights into LLM competencies, with recommendations for further testing across more models and scenarios.

Oct 30, LLMs cannot Play the Snake Game  

The blog discusses LIGHTRAG, an innovative framework for Retrieval-Augmented Generation (RAG) systems that enhances performance by incorporating graph structures and dual-level retrieval processes. It outlines the challenges faced by traditional RAG systems, such as speed, quality, and understanding limitations, and explains how LightRAG addresses these issues through efficient text indexing and retrieval methods. The framework allows for both specific and abstract queries, improving the ability to handle complex questions and providing tailored responses using a general-purpose LLM.

Oct 18, Notes on LIGHTRAG

The blog discusses two contrasting papers on large language models (LLMs): one proposes a "Re-Reading" method to enhance reasoning capabilities, showing consistent improvements in performance, while the other, GSM-Symbolic, critiques LLMs' reasoning abilities, revealing significant performance variance and limitations in mathematical reasoning. The author concludes that it's too early to declare LLMs incapable of reasoning, suggesting that current limitations may evolve.

Oct 12, Notes on Re-Reading & GSM-Symbolic 

This document summarizes my experiences in both work and personal life during September

Oct 11,Recap for September

Google has announced significant updates to their production-ready Gemini models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.

Sep 25，Notes on Gemini models

A Markov Decision Process (MDP) is a mathematical framework used for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. 

Sep 23, Markov Decision Process   

This blog post will explain the derivation of the Bellman Equation, a fundamental concept in dynamic programming used to solve optimization problems.

Sep 19, Bellman Equation

The Qwen Team has released the new Qwen2.5 series models, potentially the largest open-source release in history.

Sep 19,Notes on Qwen2.5 

Bayes’ theorem is a fundamental concept in probability theory that describes how to update our beliefs about an event based on new evidence. 

Sep 18, Bayes’ Theorem

OpenAI has introduced its new o1 series models, which are large language models trained utilizing reinforcement learning techniques to enhance complex reasoning capabilities.

Sep 13, Notes on OpenAI o1 series models

This blog post offers a personal evaluation of two recently released language models: DeepSeek-V2.5 and Reflection-70b. 

Sep 9, test DeepSeek-V2.5 and Reflection-70b

In this blog, I will share some notes and thoughts about learning the Anthropic Prompt Tutorial. Here is the link to the tutorial.

Sep 3, Notes on Anthropic Prompt Tutorial

In August, I focused on fine-tuning the Qwen2-7b model and evaluating its performance on our private benchmark consisting of over 200 questions and answers. I evaluated various large language models (LLMs) like GPT-4, Gemini 1.5-Pro, and Llama 3-405b on this benchmark to compare their capabilities in areas such as reasoning, coding, and commonsense.

Sep 1, Recap for August

In this blog post, I'll demonstrate how to use LoRA to train a model that generates images in your unique personal style. It's surprisingly simple! 

Aug 26, Flux + LoRA 

This post builds upon my previous blog of GPT-4o-mini's performance on MMLU Pro using BootstrapFewShotWithRandomSearch and BootstrapFewShotWithOptuna. In this continuation, I will examine the newly introduced optimizers, MIPRO and MIPROV2, to assess their optimization capabilities and determine the potential performance enhancements they may bring to GPT-4o-mini.  

Aug 21, GPT-4o-mini with DSPy MIPRO on MMLU-Pro 

This concise tutorial, sourced from Anthropic's official GitHub, will guide you on using Claude3 to summarize web page content. Unlike the official tutorial, this one utilizes the model claude-3-5-sonnet-20240620 and uses content from my personal web page as an example to send to the LLM.

August 19, Summarize Web Page Content with Claude3

More researchers are recognizing the significance of instruction data during the Supervised Fine-Tuning (SFT) stage. In June, I wrote a blog about data generation, but I believe it was somewhat superficial and insufficient. Since then, many new methods have emerged. Therefore, I aim to cover more papers I've read to discuss instruction data generation and selection.  

August 17, Instruction Data Generation  

In July, I helped my team build a confidential LLM benchmark tailored to our needs due to contamination in public benchmarks. Despite claims, I haven't seen LLMs surpass GPT-4 in practice. Constructing the test set was challenging, and I learned about LLM-as-a-Judge for evaluation. Personally, I experimented with Midjourney, TextGrad, Dify, and DSPy, documenting my experiences in blog posts. Additionally, I started preparing for the PTE exam, aiming for a high score on August 8.

August 1, Recap for July 

With the rapid development of LLMs, the community requires an efficient and accurate method to automatically evaluate LLM performance, as human annotation is tedious and time-consuming. LLM-as-a-Judge is now an optimized solution for this need.

July 31, LLM/VLM-as-a-Judge

DSPy is an optimization framework that enhances prompts and responses from models like GPT-4o-mini. It showcases the magic of the framework and demonstrates how to use its powerful optimizers to improve the cost-effective model. The MMLU-Pro dataset is an advanced dataset with complex questions and increased answer choices. The evaluation metric is defined to check if the model's responses match the true answers.

July 23, DSPy with GPT-4o-mini on MMLU-Pro

In this short blog, I will test Chameleon, the newest multimodal model from Meta. The baseline models I will choose are GPT-4o, Gemini-1.5-pro, Yi-vision and Yi-Vision-with-TextGrad.  

July 23, Test with Chameleon From Meta 

Evaluating LLMs is important for understanding their abilities and solving real business problems. A good evaluation requires sufficient and high-quality data samples, clear judging criteria, meaningful evaluation tasks, and frequent private benchmarks. The process should adapt to the development of LLMs over time.

July 16, LLMs Evals Thoughts

As the capabilities of Large Language Models (LLMs) continue to evolve, many traditional evaluation benchmarks may require updates. With the rapid progress of these models, researchers are increasingly introducing new evaluation datasets. However, the specific dimensions these datasets assess in the models are often unclear. In this blog, I will explore a series of commonly referenced evaluation datasets and highlight the particular aspects of model capabilities they were designed to assess even though I may not cover all available datasets. 

July 5, LLMs Evaluation Benchmarks   

Midjourney provides a platform for exploring different artistic styles and techniques. Whether you're a seasoned artist or a beginner, the tool offers a wide array of options to experiment with and refine your artistic vision. Users can blend various elements, adjust parameters, and see real-time changes, giving them a unique and interactive experience.

July 7, Weekend with Midjourney 

DSPy is a framework developed by Stanford. It is used for programming to automatically optimize prompts and weights in Large Language Models (LLMs). DSPy can enhance the reliability of any model, whether it's GPT-4, LLaMA3 or Mistral, for any task you require. 

June 30, DSPy

Inspired by Nezhurina et al. 2024, I employ similar questions to evaluate various leading language models, demonstrating their reasoning capabilities.  Thus, this blog will resemble a test report. This test is very subjective. So, if the outcome does not meet your expectations, just take it in stride.

June 29, Alice in Wonderland Test

This month has been emotionally intense, marked by a series of intriguing and unfortunate events. Many instances sparked curiosity and inspiration, while others sadly brought about sorrow and anger. It's truly been a month full of diverse experiences.

June27, Recap for June

TextGrad is an innovative autograd engine, particularly tailored for textual gradients. As a robust framework, it facilitates automatic  meticulously implements backpropagation using feedback provided by advanced Large Language Models (LLMs), firmly anchored in the gradient metaphor.

June 26, TextGrad 

Many studies have shown that large language models can stimulate their ability to follow instructions and generalize on more tasks during the fine-tuning stage. However, if we only rely on manual handwritten instruction data, it will consume a lot of human resources, and the quantity is limited.Therefore, it is essential to explore other automatic methods for generating instruction data.  

June 5, Synthetic Instruction Data Generation 

I recently read the official documents of CrewAI and AutoGen, and I thought Agent was very cool

June 4,  Pieces of Thoughts in June

This month at work mainly focused on completing a few tasks: explored more possibilities of using Prompt Chain, using Prompt Chain to write stories, can generate a pretty good story.

May 30, Recap for May

Prompt has always been a topic of controversy. While some consider it insignificant and lacking in technical substance, others regard it as the crux of effectively utilizing large language models. Learning how to use Prompt can unlock the vast potential of these models.

May 24, Prompt Engineering    

Thoughts

Math&Statistics

Previous Posts

Posts Clusters

Posts Tags

About Me 

personal

Newsletter

开源项目

友链

2025

公众号

MiniMax

prompt

2024

Data

news

Abstract

Author

author

content

featured_image

views

comments

published

related_posts

rating

is_featured

updated_at

Domain & Institution

Priority

password

icon

date

type

Creation Date

slug

status

title

summary

Config

Post Board

Post Gallery

Table

类型为Notice的文章将被显示为公告

Notes

Reinforcement Learning

The infinite possibilities of AI and Mathematics converge with the boundless creativity of art

Data Curation

Performance

Sky-T1

Evaluation Results

Findings: