type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
Table of Contents
1-Self-Instruct 2-Instruction Backtranslation 3-Wizard LM 4-GenQA 5-Nemotron-340B-Technical Report 6-Summary
Many studies have shown that large language models can stimulate their ability to follow instructions and generalize on more tasks during the fine-tuning stage. However, if we only rely on manual handwritten instruction data, it will consume a lot of human resources, and the quantity is limited.Therefore, it is essential to explore other automatic methods for generating instruction data.
In this blog, I will review some of the main methods for synthesizing data.
1-Self-Instruct
Self-Instruct (Wang et al. 2022) is a semi-automated process for data generation. The process is as follows:
- Instruction Generation In the first step, we prepare a seed dataset comprising 175 tasks, with each task consisting of one instruction and one instance. This forms our task pool. We then sample eight tasks for few-shot examples to prompt the language model to generate instructions. These samples include six tasks that are human-written and two tasks derived from model-generated tasks in previous steps.
- Classification Task Identification In the second step, we must determine whether the generated instructions correspond to a classification task. To do this, we prompt the language model using 12 classification tasks and 19 non-classification tasks as few-shot examples.
- Instance Generation In the third step, there are two ways to generate instance, output-first and input-first.
- Input-first: Based on the instructions we provide, we can prompt a language model to generate the input field first, followed by the corresponding output. This approach closely mirrors how the model responds to instructions and inputs.
- Output-first: This method is designed for classification tasks. Unlike the input-first approach, which pushes the input towards a single label, the output-first approach first generates the potential class label, then creates the corresponding input.
- Filtering and Postprocessing The final step involves significant post-processing of the generated data, and there are several strategies for handling this.
- If the ROUGE-L similarity of a new instruction to any other instruction is less than 0.7, it will be accepted into the task pool.
- Instructions containing certain keywords, such as 'images' or 'pictures', which cannot be processed by language models, should be excluded.
- Filter out duplicate and problematic data
The workflow follows an iterative bootstrapping algorithm, starting with a seed dataset of 175 cases. After several iterations, this method ultimately yields a dataset containing 52k instructions.
2-Instruction Backtranslation
"Instruction Backtranslation" (Li et al. 2023), which can be found here, is designed to construct a high-quality instruction following model. While it primarily appears to be a model-building method, it's important to note that it also involves a unique approach to data generation. The process is as follows:
There are three steps in instruction back-translation, including the initialization stage.
- Initialization We should prepare a base model, such as LLaMA, along with some seed data (instruction and output pairs), and a large unlabelled corpus.
- Self-Augmentation During this stage, we need to fine-tune the base model using the seed data, which consists of output and instruction pairs. This adjustment allows our model to predict instructions based on the output. Then, we can use the fine-tuned model to generate the instruction for the unlabelled data.
- Self-Curation In this step, the model will select high-quality data independently. For the first iteration, there's an evaluator model, , fine-tuned from seed data. It scores each piece of generated augmented data to produce a score, $a_i$. We then select the data with a score of to form the augmented data set . This data set, along with the seed data, is used to fine-tune model and create a new model, .
We repeat the self-selection process with and select high-quality data to form a new dataset . For the second iteration, we use the seed data and to fine-tune $M_1$ and obtain a new model, .
Instruction Backtranslation (Li et al. 2023) shows that good performances can be achieved by only two iterations, compared other models.
3-Wizard LM
Wizard LM (Xu et al. 2023) proposes a new method named Evol-Instruct. It creates a large amount of instruction data with varying complexity levels using Language Model (LLM) instead of human input. The process example is as follows:
There are two main options for Evol-Instruct, which are In-Depth Evolving and In-Breadth Evolving.
- In-Depth Evolving For In-Depth evolution, the core strategy is to prompt the Large Language Model (LLM) to generate complex instructions. There are five types of prompts to accomplish this: adding constraints, deepening, concretizing, increasing reasoning steps, and complicating the input.
- In-Breadth Evolving In-Breadth Evolution aims to cover topic diversity, meaning it focuses on the overall diversity of the dataset.
There are also situations where instructions fail to evolve, which we need to eliminate.
- The evolved prompt exactly copies some words from the prompt
- Generated instructions can make it challenging for the LLM to respond, especially when the response includes words like "sorry" or when the responses are overly brief.
- The response generated by LLM contains stop words and punctuations.
- The updated instructions do not exhibit any enhancements or variations in comparison to the original instructions.
The experiments show that instructions by Evol-Instruct are superior to human created ones.
4-GenQA
GenQA (Chen et al. 2024) is an innovative concept in the field of synthetic data generation. It is essentially a large synthetic dataset that is generated from a single prompt. The unique aspect of this methodology is that this solitary prompt serves as an instructional guide for large language models (LLMs). This instructional guide is designed to facilitate LLMs in the generation of diverse and high-quality instruction datasets. One of the primary benefits of GenQA is that it allows for a broad range of instructions to be created, thereby enhancing the flexibility and adaptability of the LLMs.
The single prompt to generate data is as follows:
List 60 topics that you can answer questions about. Choose a topic uniformly from this list, and state it. Then write 60 subtopics about the chosen topic. Then choose a subtopic uniformly from this list, and state it. Then write a question that is not about the subtopic, but can only be answered with expertise in the subtopic. Then write the answer. Both the question and answer should be long. The name of the subtopic should not appear in the question, and none of the words in subtopic should be reused in the question. Begin your questions with "Question:" and your answer with "Answer:". Be creative.
This prompting strategy can maximize the randomness and diversity of the outputs generated by the large language models(LLMs).
5-Nemotron-340B-Technical Report
Nemotron (Nvidia, 2024) is the 340B model released by Nvidia. Over 98% of the data used for supervised fine-tuning and preference fine-tuning is synthesized. The workflow is as follows:
- Synthetic single-turn prompts The single-turn prompts cover many topics such as open Q&A, writing, closed Q&A, math&coding and so on. The pipelines are as follows:
- Synthetic instruction-following prompts To ensure and improve the model’s instruction-following capabilities, generating single turn instruction-following prompts and multi-turn instruction following prompts are very important.
- Synthetic two-turn prompts This part typically involves preference fine-tuning to enhance the model's multi-turn conversation skills. The format is
User: XXX; Assistant: XXX; User: XXX
- Synthetic Dialogue Generation This section aims to improve the model's ability to handle multi-turn conversations. Each dialogue involves three turns. The generator model alternates between simulating the roles of the Assistant and the User through iterative role-playing. Providing the model with explicit prompts that define distinct user personalities is essential to encourage desired behavior during user turns.
6-Summary
In this blog, I discuss several important methods for generating synthetic data. However, I do not cover how to assess the quality of this data. Quality of instruction data, which I will address later, is also crucial in the fine-tune stage. Currently and in the foreseeable future, synthetic data continues to be a significant area of study in the field of artificial intelligence.
- Author:Chengsheng Deng
- URL:https://chengshengddeng.com/article/synthetic-instruction-data-generation
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts