August 17, Instruction Data Generation

type

status

date

slug

summary

1-Introduction

More researchers are recognizing the significance of instruction data during the Supervised Fine-Tuning (SFT) stage. In June, I wrote a blog about data generation, but I believe it was somewhat superficial and insufficient. Since then, many new methods have emerged. Therefore, I aim to cover more papers I've read to discuss instruction data generation and selection.

There are generally two ways to obtain instruction data: through human annotation or by using automatically generated data with LLMs. However, traditional annotation can be quite costly. As a result, many researchers are focusing on methods to automatically create high-quality synthetic instruction data.

In this blog, I will discuss the automated method for generating instruction data.

2-Seed Data and Prompt Engineering

2-1-Self-instruct

Most research on synthetic instruction data generation combines seed data with prompt engineering. Researchers prepare the seed data and use prompt techniques with LLMs to generate more instruction data.

According to my limited exploration, Wang et al. (2022) first presented the semi-automated process, named self-instruct, for instruction-tuning a pretrained LLM. The process is as follows:

Figure1. A high-level overview of SELF-INSTRUCT( Wang et al., 2022)

As shown, there are four steps in this process.

Creating task instructions. Self-instruct creates new instructions from a limited set of initial human-written instructions through a bootstrapping method. The authors compile a task pool containing 175 tasks, each with one instruction and one instance. At each step, they select 8 task instructions from this pool to use as in-context examples. To maintain diversity, 6 of the 8 instructions are from human-written tasks, while 2 are from model-generated tasks from previous steps. The prompt template as follows:

Figure2. Prompting template of instruct generation( Wang et al., 2022)

Identifying if the instruction is for a classification task. Prompting the LLM with a few-shot approach to identify if the generated instruction is a classification task. The authors include 12 classification instructions and 19 non-classification instructions from the seed task as examples in the prompt.

Generating instances using input-first or output-first methods. There are two ways to generate instances. The first is the Input-first Approach. Here, the authors ask the LLM to first come up with the input fields based on the instruction, and then produce the corresponding output. However, they find that this method can sometimes generate inputs biased toward one label, especially for classification tasks. Therefore, they choose the second approach for classification tasks, which is the Output-first Approach. In this method, they first generate the possible class labels and then condition the input generation on each class label. The authors apply the output-first approach for classification tasks and the input-first approach for non-classification tasks.

Filtering out low-quality data. Ensuring the quality of the dataset is crucial. The authors assess the similarity between newly generated instructions and those in the task pool using ROUGE-L as the metric. Additionally, they exclude instructions containing specific keywords that LLMs cannot handle.

The dataset the authors generate with self-instruct is named “Alpaca.” The following is an examination of the dataset:

Figure3. Dataset Overview( Wang et al., 2022)

As noted, most of the generated instructions are meaningful, although the instances may contain some noise. They still offer valuable guidance for training models to follow instructions.

2-2-Evol-Instruct

Xu et al. (2023) introduced a new method called "Evol-Instruct" for generating complex instruction data, based on this "Alpaca" dataset. The pipeline of the evol-instruct is shown below:

Figure4. Overview of Evol-Instruct(Xu et al., 2023)

The pipeline of the instruction evolution includes three steps:

Instruction Evolution. This stage focuses on using LLMs to refine instructions, making them more detailed and challenging with precise prompts. It also involves creating entirely new instructions that are complex and unique. There are two types of instruction evolution: in-depth evolving and in-breadth evolving. In-depth evolving enhances instructions by making them more complex and difficult through five types of prompts: adding constraints, deepening, concretizing, increasing reasoning steps, and complicating input. The main goal of in-depth evolving’s prompt is: “Your objective is to rewrite a given prompt into a more complex version to make AI systems like ChatGPT and GPT-4 find it a bit harder to handle. However, the rewritten prompt must still be reasonable, understood, and responded to by humans.” Here are examples of the four types of prompt templates, excluding the complicating input prompt for simplicity.

Figure5. In-depth evolving prompt template(Xu et al., 2023)

In-breadth evolving aims to enhance the topic coverage, skill coverage and overall dataset diversity. Many open-domain instruction datasets are small in scale, lacking topic and skill diversity. So, the prompt designed to generate a completely new instruction based on the given instruction, requiring the new instruction to be more long-tailed is as follows:

I want you act as a Prompt Creator. Your goal is to draw inspiration from the #Given Prompt# to create a brand new prompt. This new prompt should belong to the same domain as the #Given Prompt# but be even more rare. The LENGTH and difficulty level of the #Created Prompt# should be similar to that of the #Given Prompt#. The #Created Prompt# must be reasonable and must be understood and responded by humans. ‘#Given Prompt#’, ‘#Created Prompt#’, ‘given prompt’ and ‘created prompt’ are not allowed to appear in #Created Prompt#. #Given Prompt#: <Here is instruction.> #Created Prompt#:

Response Generation. In this stage, we use the same LLM to generate responses for the evolved instructions. The generation prompt is straightforward: “<Here is instruction>”.

Elimination Evolution. This stage involves verifying qualifications for instructional evolution. There are four scenarios that are considered failures in this evolution process.

The revised instruction does not offer additional information compared to the original. The authors utilize ChatGPT to make the decision, as shown in the following prompt:

Here are two Instructions to ChatGPT AI, do you think they are equal to each other, which meet the following requirements:

They have same constraints and requirments.

They have same depth and breadth of the inquiry. The First Prompt: <Here is first instruction.> The Second Prompt: <Here is second instruction.> Your Judgement (Just answer: Equal or Not Equal. No need to explain the reason.):

The updated instruction complicates the LLM's ability to generate a response. The authors exclude responses that contain "sorry" and are relatively short.

The response generated by the LLM consists only of punctuation and stop words.

The updated instruction clearly copies some words from the evolving prompt, such as “given prompt”, “rewritten prompt”, “#Rewritten Prompt#”, etc.

2-3-Instruction Back-Translation

Occasionally, you may have access to a large corpus but only a limited amount of seed data. How can you leverage these resources to generate a higher quality instruction dataset? Li et al. (2023) introduced a method known as instruction back-translation. The workflow of this method is as follows:

Figure6. A workflow of instruction back-translation(Li et al., 2023)

As noted, this process involves two core steps. First, we need to prepare: seed data, unlabelled data (e.g., the web corpus), and a base model.

Self-augmentation. Training a backward model to generate instructions for unlabelled data to create candidate training data for instruction tuning in this step.

Self-curation. At this stage, we'll start by using the model trained with the seed data to select high-quality data. This involves prompting the trained model to rate the quality of a candidate pair on a 5-point scale. The specific prompt used is shown in Figure 7. This process is iterative, allowing an improved intermediate instruction-following model to enhance data selection for fine-tuning in subsequent iterations.

Figure7. A prompt used in the self-curation step to evaluate the quality of a candidate pair(Li et al., 2023)

Zheng et al. (2024) also utilize the concept of "instruction back-translation" to create an advanced data generation pipeline. They develop a more sophisticated process to enhance and refine data. The process is as follows:

Figure8. Answer Polish for Instruction Back-Translation(Zheng et al. 2024)

It looks similar to the method presented by Li et al. (2023) but has some differences. Let’s explore their methods. There are two main steps in this process.

Supervised Fine-Tuning (SFT) with High-Quality Seed Data: This step uses SFT on the base model with high-quality seed data to create two models – the label model for annotating primary data and the primary chat for improving data quality.

Quality Assessment and Refinement in Primary Chat: The primary chat evaluates and refines the label model’s output. This iterative process generates a substantial amount of high-quality data, which is crucial for the primary chat’s further training. It ultimately results in a high-performance final chat model, extensively trained with superior data.

Zheng et al. (2024) also test two different filtering methods for selecting the high quality data since not all candidate labeled data are high quality. Each filtering method has its own strengths and weaknesses:

Comprehensive Scoring of Labeled Data: This approach evaluates the entire labeled dataset using a combined score that includes both instructions and outputs. Unfortunately, good outputs can be discarded due to poor instructions from the labeled model, and vice versa, leading to unnecessary exclusions. This results in inconsistent data quality, which can adversely affect further training.

Focused Scoring of Instruction Component: This technique evaluates only the instruction part (output from the label model). High-scoring instructions are selected, and then the output part of these chosen data is refined. However, it doesn’t evaluate the output, which can sometimes result in suitable instructions paired with unsuitable outputs. To address this, we can use the primary chat model to evaluate and refine both the instructions and outputs, ensuring they align effectively.The score and refine prompts are shown below:

Figure9. Score Prompt and Refine Prompt(Zheng et al. 2024)

Nguyen et al. (2024) also utilize the back-translation concept to develop a data generation pipeline, but they introduce some enhancements. Following is the overview of pipeline:

Figure10. Overview of pipeline(Nguyen et al. 2024)

As illustrated in Figure 10, there are three core steps in this pipeline: (1) Backtranslation, (2) Filtering, and (3) Rewriting. The first two steps, (1) and (2), are based on the work of Li et al. (2023), while the third step, (3), is similar to Zheng et al. (2024). The authors prompt an aligned LLM, Llama2-70B-chat to rewrite the response to improve its quality. The full rewriting prompt is below:

Figure11. Rewriting Prompt(Nguyen et al. 2024)

The authors also design experiments to evaluate the quality of the rewritten data. They begin by using the MAUVE score to quantify the distributional differences among three sets of responses: initial web-scraped responses, rewritten responses, and responses distilled from Llama2-70B-Chat. Originally, MAUVE was designed to measure the gap between machine-generated and human-generated texts. A higher MAUVE score indicates greater similarity in text distributions. The results are shown in Figure 12:

Figure12. Rewrite responses performance(Nguyen et al. 2024)

As observed, the rewritten responses share some similarities with the distilled responses, yet there remains a significant gap between them. This indicates that the rewriting process is quite different from distillation. The authors also compare the empirical performance of fine-tuning on rewritten data versus distilled data. For the distilled data, they use 25.6K instructions randomly sampled from our filtered backtranslated dataset and let the Llama-2-70B-chat model answer directly. For the rewritten data, they use the same 25.6K instructions and prompt the Llama-2-70B-chat model to rewrite the corresponding web-scraped responses. The results are as follows:

Figure13. Performance of fine-tuning Llama-2-70B(Nguyen et al. 2024)

Fine-tuning a Llama-2-70B model on distilled responses results in a lower win rate compared to fine-tuning on rewritten texts. This indicates that the rewriting process enhances the overall quality of response data, not just extracting existing knowledge from the LLM.

FANNO framework is another instruction data generation pipeline designed by Zhu et al.(2024) which is also similar with the methods mentioned above. There are three pivotal steps: document pre-screen, instruction generation, and response generation. Following is the picture to show how it works :

Figure14. Overview of FANNO framework(Zhu et al.2024)

Document Pre-Screen. This stage involves segmentation, deduplication, and length-based filtering. We also use a teacher LLM and a detection algorithm to improve accuracy and diversity. The LLM-based filter handles ambiguous content, privacy issues, and advertisements.

Instruction Generation. This stage consists of two phases: seed instruction generation and instruction augmentation. During the seed instruction generation phase, a variety of initial seed instructions are produced. Diversity is emphasized from two angles: task types and difficulty levels, with corresponding tags manually created by the authors. An LLM-based filter is then employed to ensure the quality of the seed instruction data. The prompts are shown as below:

Figure15. Prompts for instruction generation filter(Zhu et al.2024)

For the instruction augmentation step, the authors recognize that the diversity of instructions in the seed pool is inherently limited. Therefore, they have designed a new prompt template called Think Different, as shown below:

Figure16. Prompts for Think Differently(Zhu et al.2024)

This prompt template guides the teacher LLM to produce high-quality instructions that match the quality of the example while varying in format (such as task types and questioning styles). Additionally, a document is included in this template to ensure the generated instructions align with or build upon the content of the document.

Response Generation. In this stage, the response to each instruction is created by prompting the teacher LLM, either with an empty context or a retrieved document. The authors suggest using retrieval augmented generation (RAG) to include the relevant document, providing extra information for response creation. Subsequently, the LLM itself is used to choose the highest quality response. The prompt templates for response generation and selection are as follows:

Figure17. Question, Document to Answer(Zhu et al.2024)

Figure18. Question to Answer(Zhu et al.2024)

Figure19. Prompt for Faithfulness Evaluation(Zhu et al.2024)

2-4-Only Prompt Techniques

Most of the time, we lack seed data or a substantial web corpus and rely solely on prompts for data synthesis. So, how do we handle this situation?

Chen et al. (2024) provide an answer. They design a series of prompts to generate millions of instructions, considering various prompt types in increasing complexity. They rigorously analyze the diversity these prompts produce. Below are the prompts they designed for their study.

Figure20. A study of Generator Prompts(Chen et al. 2024)

For the static prompt, it tends to produce many identical outputs. To address this, the authors prepare a list of topics and then use the static-conditional prompt, which is based on a random topic. To further enhance randomness and prevent the model from focusing on a single mode for each topic, they use the generator-conditional prompt, also based on a random topic. However, the generator-conditional prompt limits the range of possible topics. Therefore, they introduce the generator-nested prompt to ensure randomness. This prompt, however, has a potential drawback: the model sees the selected indices before generating the list, which may influence the order of the listed items. As a final step, they design the generator-uniform prompt.

Here are some examples of data generated by these prompts with Gemini 1.0 Pro.

Figure21. An example of the data(Chen et al. 2024)

Figure22. An example of the data(Chen et al. 2024)

Figure23. An example of the data(Chen et al. 2024)

Figure21 is based on the topic "Crime and deviance," which was randomly selected during generation. The meta-prompt used to create this figure is as follows:

List 40 subtopics in the domain of Crime and deviance. State subtopic 14. Then write a question that is not about subtopic 14, but can only be answered with expertise in subtopic 14, and then write the answer. Both the question and answer should be long. The name of the subtopic should not appear in the question. Begin your questions with "Question:" and your answer with "Answer:".

The Figure22 and Figure23 are directed to concentrate on topics found in MMLU. Please note that while these questions pertain to MMLU topics, they are not necessarily formatted in the MMLU style and should not be considered representative of MMLU questions. The meta-prompts used to generate these two questions are as follows:

List 40 subtopics in the domain of High School European History. State subtopic 25. Then write a question that is not about subtopic 25, but can only be answered with expertise in subtopic 25, and then write the answer. Both the question and answer should be long. The name of the subtopic should not appear in the question. Begin your questions with "Question:"and your answer with "Answer:". Be creative and don’t ask the first thing you think of.

List 40 subtopics in the domain of Evolution. Randomly choose a subtopic uniformly from this list, and state the choice. Then write a long complex multiple-choice question that is not about the subtopic, but can only be answered with expertise in the subtopic. The question should end with a list of choices. Then write the answer, followed by an explanation of your choice. The name of the subtopic should not appear in the question. Begin your questions with "Question:" and your answer with "Answer:". Don’t ask the first thing you think of.

Chan, et al. (2024) present a unique method utilizing the 1,000,000,000 personas with prompt technique, which differs from the approach by Chen et al. (2024), to generate a substantial instruction dataset.

There are two scalable approaches to create diverse personas for constructing a Persona Hub from extensive web data: Text-to-Persona and Persona-to-Persona.

Text-to-Persona. The authors prompt the LLM to produce detailed persona descriptions. The granularity of these descriptions can be controlled through the prompt and influenced by the input texts. Here is an example:

Figure24. An example of Text-to-Persona(Chan, et al. 2024)

Figure25. An example of Text-to-Persona(Chan, et al. 2024)

As noted, when the input text is detailed and specific, the resulting persona description will also be precise and thorough.

Persona-to-Persona. This method is viewed as the supplement for the Text-to-Persona. Even though text-to-persona is a highly scaleable method that can synthesize personas covering almost every aspect, it may still miss some personas that have low visibility on the web. So, persona-to-persona is the method which derives personas with interpersonal relationships from those who obtained through Text-to-Persona. Following is the example:

Figure26. An example of Persona-to-Persona(Chan, et al. 2024)

As shown above, the persona about a child can be derived from the persona of a nurse at a children’s hospital.

Before creating the synthetic data based on persona-driven, the authors deduplicate these personas in two ways.

MinHash-based Deduplication

Embedding-based Deduplication.

After deduplication, creating persona-driven synthetic data becomes straightforward. The authors use three prompting methods as we can see in Figure21:

Figure27. 0-shot, few-shot and persona-enhanced few-shot prompting methods(Chan, et al. 2024)

Zero-shot prompting leverages the model's creativity by not using any existing examples, allowing for unrestricted innovation.

Few-shot prompting ensures the synthesized data meets requirements by providing a few illustrative examples.

Persona-enhanced few-shot prompting effectively enhances the model's persona-driven data synthesis capabilities, though it requires identifying the corresponding persona for each example in the few-shot prompt in advance.

As this work focuses on generating new synthetic data rather than synthesizing solutions, the authors use gpt-4o to generate solutions for the created problems.

The two methods presented by Chen et al. (2024) and Chan, et al. (2024) both involve prompting LLMs with specific inputs. In contrast, Xu et al. (2024) introduce an innovative method, MAGPIE, to generate high-quality instruction data by prompting aligned LLMs without any input.

MAGPIE consists of two steps:

Instruction Generation. This step involves generating an instruction for each piece of instruction data. Using an open-weight aligned LLM (such as Llama-3-70B-Instruct), MAGPIE creates an input query formatted according to the LLM's predefined instruction template such as [INST] Hi! [/INST]. This query specifies the role of the instruction provider (e.g., user) without giving any actual instruction. It's important to note that the auto-regressive LLM has been fine-tuned with instruction data formatted in the predefined template. As a result, the LLM generates an instruction on its own when MAGPIE's crafted query is inputted. MAGPIE stops the instruction generation once the LLM produces an end-of-sequence token. Sending the crafted query to the LLM multiple times will produce a set of instructions.

Response Generation. The aim of this step is to create responses based on the instructions from Step 1. MAGPIE forwards these instructions to the LLM to produce the relevant responses.

The MAGPIE workflow is outlined below:

Figure28. MAGPIE workflow(Xu et al. 2024)

One key advantage of the MAGPIE is its ease of extension for generating multi-turn instruction datasets and preference datasets. Additionally, it allows us to specify the task requested by the instructions. The full prompt for building the instructions of multi-turn datasets are as follows:

$<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful Al assistant. The user will engage in a multi−round conversation with you, asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and insightful responses to help the user with their queries.<|eot_id|><|start_header_id|>user<|end_header_id|> {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {response}<|eot_id|><|start_header_id|>user<|end_header_id|>$

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful Al assistant. The user will engage in a multi−round conversation with you, asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and insightful responses to help the user with their queries.<|eot_id|><|start_header_id|>user<|end_header_id|> {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {response}<|eot_id|><|start_header_id|>user<|end_header_id|>

For the preference datasets, MAGPIE integrates responses generated by the instruct model with those from the base model. A preference dataset can be created by designating the response from the instruct model as the preferred response, and the response from the base model as the less preferred one.

<ins/>

3-Conclusion

In this blog, I explore some new methods of synthetic data generation that are straightforward to understand and implement. I find these methods particularly useful for my own fine-tuning work. While this blog doesn't cover every method available, I encourage you to share your insights and comments. I look forward to reading them.

Synthetic data generation is crucial during the post-training stage for training an LLM, as demonstrated by models like Qwen2 and Llama-3. Therefore, continue to explore more possibilities for generating data!

4-References

Wang, Kordi, et al. “Self-Instruct: Aligning Language Models with Self-Generated Instructions” arXiv preprint arXiv:2212.10560(2022)

Xu, Sun, et al. “WizardLM: Empowering Large Language Models to Follow Complex Instructions” arXiv preprint arXiv: 2304.12244(2023)

Li, Yu, et al. “Self-Alignment with Instruction Backtranslation” arXiv preprint arXiv:2308.06259(2023)

Zheng, Guo, et al. “Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation” arXiv preprint arXiv:2401.06477(2024)

Nguyen, Li, et al. “Better Alignment with Instruction Back-and-Forth Translation” arXiv preprint arXiv:2408.04614(2024)

Chen, Qadri, et al. “GenQA: Generating Millions of Instructions from a Handful of Prompts”arXiv preprint arXiv:2406.10323(2024)

Chan, Wang, et al. “Scaling Synthetic Data Creation with 1,000,000,000 Personas”arXiv preprint arXiv:2406.20094(2024)

Xu, Jiang, et al. “Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing” arXiv preprint arXiv:2406.08464(2024)

Zhu, Su, et al. “FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only” arXiv preprint arXiv:2408.01323(2024)

<ins/>