Nov 15，Notes on OPENCODER

type

status

date

slug

summary

Introduction

In this blog, I share notes on an intriguing paper I recently read: OPENCODER: THE OPEN COOKBOOK FOR TOP-TIER CODE LARGE LANGUAGE MODELS.

This comprehensive guide teaches readers how to train a Large Language Model (LLM) from scratch—not just for coding models, but for general models as well.

Pretraining Data

Pretraining data is a crucial component in training LLMs. The authors of this paper detail their strategies for collecting and processing data in both the pretraining and annealing stages.

The RefineCode dataset they collect is the first they discuss.

RefineCode

The authors introduce RefineCode, a high-quality dataset of 960 billion tokens spanning 607 programming languages. It comprises two main parts: raw code and code-related web data. The raw code data primarily comes from GitHub repositories up to November 2023, with additional non-GitHub data from The Stack V2. The code-related web data is mainly sourced from web corpora.

RAW CODE

To ensure the high quality of the raw code data, the authors develop a detailed five-step pipeline to process it.

Preprocessing: The authors exclude files exceeding 8MB in size, which are predominantly non-text files. They then restrict the selection to file types related to programming languages based on file extensions, filtering out those with low capacity or quality.

Deduplication:The authors present two deduplication methods to eliminate repetitive content and ensure data diversity:

Exact Deduplication: Due to forking and copy-pasting within the codebase, nearly 75% of files are duplicates. The authors compute the SHA256 hash value for each document. Files with identical hash values are compared, retaining only the code files with the highest star count and the latest commit time.
Fuzzy Deduplication: Following the general data pipeline's fuzzy deduplication setting, the authors split the raw text into 5-gram pieces and calculate 2,048 MinHash functions. They then use Locality-Sensitive Hashing (LSH), setting 16 bands and 128 rows, to retain distinct files with the highest stars and latest commit time. This process removes 6% of the file volume.

Transformation The authors think removing files is a useful method to solve some files that fails to meet criteria but it cannot be used in numerous files. So, they do two transformation work on the files: First, they remove the copyright notices from the initial code comments and second, they replace the person identifiable information with placeholders such as “<name>” and “<password>”.

Filtering The authors consider three guidelines when designing filters:

Filter out files with poor self-containment

Filter out files with poor or minimal logical structure

Remove files that deviate significantly from standard formatting

Based on these guidelines, they develop three categories of filtering rules:

Natural Language Filtering Rules: These rules apply to all text and code files, filtering data based on common properties such as file size, number of lines, and other general metrics.
General Code Filtering Rules: These rules apply to all code files, filtering data based on the number of variables, average function length, and other common coding features.
Language-Specific Filtering Rules: These rules are tailored to the characteristics of specific programming languages, such as the frequency of "pass" statements in Python.

Data Sampling: In this step, the authors strive to preserve the original data distribution as much as possible to maximize the utilization of their clean, high-quality dataset. They downsample certain high-resource programming languages before pretraining. This process ultimately yields about 730B tokens for the pretraining stage.

CODE-RELATED WEB DATA

The authors gather high-quality code-related data from the Common Crawl dataset. They begin by annotating 500,000 pieces of high-quality code-like data from CommonCrawl using the Autonomous Data Selection method. This annotated data serves as seed data for training FastText and forms the initial code seed corpus.

They then design a processing pipeline for code-related web data, comprising four main components:

FastText Model Training: They apply a BPE (Byte Pair Encoding) tokenizer to segment the corpus and then use the open-source FastText framework for model training.

Recall from Common Crawl: They perform recall on Common Crawl to generate the code-related web corpus.

Code-related Domain Discovery: They conduct URL Annotation, manually annotating URLs associated with code content within the identified domains.

They apply the same pipeline to FineWeb, Skypile, and the web part of AutoMathText, producing 330GB of code-related web data in total. Additionally, they collect 178GB of code-related textual data from GitHub.

The composition of the data sources is as follows:

ANNEALING DATA

The annealing stage serves as a bridge between the general pretraining stage and the supervised fine-tuning (SFT) stage. The data in this stage is crucial, and it's essential to maintain a data distribution similar to the pretraining phase. The categories of data in the annealing stage are as follows:

Algorithmic Corpus The algorithmic code files exhibit strong logic and minimal dependency on external files, demonstrating excellent self-containment. These files are also independent tasks commonly encountered in real-world intereactive scenarios. Therefore, the authors ample a certain proportion of the original pretraining data that contains keywords such as “leetcode”, “def solution” or “class solution” to create this corpus.

Synthetic Data The authors realize the importance of synthetic data and they select Algorithmic Corpus as the seed because it encompasses a wide range of algorithmic logic.

POST TRAINING

Post-training has become increasingly important, with numerous studies now focusing on this stage. The authors begin by discussing the composition of post-training data.

DATA COMPOSITION

Open-source Training Data The authors collect various open-source instruction corpora from websites, including Evol-Instruct, Infinity-Instruct, and McEval. They use an LLM to perform binary classification, extracting code-specific segments. Additionally, they sample real user queries from WildChat and Code-290k-ShareGPT, using an LLM to extract code-related dialogue histories and clean the data. For low-quality responses, they employ a robust LLM to regenerate the content, enhancing overall data quality.

Educational Instruction Synthesis The authors use code snippets from real-world sources as seed data to synthesize question-answer pairs. This approach ensures diverse and rich instruction-tuning datasets.

Package-related Instruction Synthesis The authors address the issue of outdated package usage in pre-training data undermining model performance. To mitigate the impact of outdated programming syntax and obsolete external library interfaces, they synthesize a tool usage instruction tuning dataset using up-to-date external library documentation.

Large-scale Diverse Instruction Synthesis To ensure diversity, the authors create a large-scale instruction data synthesis framework. The framework includes four key components:

An LLM cleans irrelevant context (such as web advertisements) and selects useful data as seeds for further question generation.

A task specification module defines programming languages, difficulty levels, and coding task types. The prompt engineering component uses a template-based system to generate diverse, contextually rich prompts, incorporating real-world scenarios and software development best practices.

A more advanced LLM with additional parameters generates both the questions and corresponding answers.

Another LLM refines the responses by adding code comments and more detailed explanations.

TWO-SATGE INSTRUCTION-TUNING

The authors implement a two-stage instruction fine-tuning process to ensure that CodeLLM excels in both theoretical knowledge and practical coding tasks. This approach aims to meet the needs of developers, beginners, and professionals alike.

The first stage focuses on synthesizing question-answer (QA) pairs related to theoretical computer science. This ensures that the model can respond with greater precision to questions about concepts such as binary search trees, dynamic programming, and other fundamental topics.

The second stage concentrates on practical coding tasks. Here, the authors use high-quality code from GitHub to create a dataset that improves the model's ability to generate and work with code. By fine-tuning on this high-quality data, the model is exposed to real-world examples of well-formatted code. This enhances its ability to generate code that is both syntactically and semantically correct.

Conclusion

In this blog, I've shared my notes on the paper "OPENCODER: THE OPEN COOKBOOK FOR TOP-TIER CODE LARGE LANGUAGE MODELS." This isn't just an excellent paper—it's a comprehensive technical guide on training a Coder LLM. The paper delves into numerous training details and model architecture specifics, making it a must-read for anyone interested in the field!