type
status
date
slug
summary
tags
category
password
icon
Author
Abstract
I recently read a detailed and comprehensive paper on preparing high-quality datasets, and I’m summarizing key points here for future reference.
Even though this paper specifically addresses datasets for reinforcement learning, the underlying principles are broadly applicable to other specialized datasets.
Current math datasets face two primary issues:
  • Human-written datasets are high-quality but limited in quantity.
  • Machine-generated datasets are abundant but often lack quality assurance.
To address these issues, the authors developed a rigorous process for cleaning and curating datasets using strict filtering criteria.

Data Collection

The authors collected data from three widely-used datasets:
  • HARP
  • Omni-MATH
  • NuminaMath
Additionally, they incorporated synthetically generated data. The original and filtered quantities of data from each source are summarized below:
notion image
Each dataset required a tailored collection strategy:
  • HARP: Only the “short answer” subset was selected, excluding questions requiring proofs or multiple-choice answers.
  • Omni-MATH: This dataset includes 4,500 Olympiad-level problems sourced from 39 competition websites, with professional annotators and verifiers ensuring solution quality.
  • NuminaMath: Synthetic subsets such as synthetic_math, synthetic_amc, and MATH were excluded due to uncertain quality.

Data Cleaning and Filtering

To ensure dataset quality, the authors employed a human-in-the-loop methodology, iteratively refining filters through human verification and annotation.
  • HARP: Problems containing figures in the Asymptote vector graphics language (identified by “[asy]”) were removed, totaling 625 problems, as these require visual interpretation.
  • Omni-MATH: Problems containing personal information were manually revised (45 total). Additionally, problems without extractable solutions or containing competition scoring details were removed.
  • NuminaMath: MinHashLSH filtering was used for deduplication, with similarity thresholds set at 0.6 or 0.7. Problems lacking explicitly boxed answers (“” in LaTeX) were filtered out. For the aops_forum subset, 2,535 problems containing extraneous information (e.g., submission year, scoring details) were cleaned using regular expressions.
After initial filtering, 463,426 problems remained. However, the authors determined further refinement was necessary for reinforcement learning suitability.

Additional Filtering Steps

To convert the raw dataset into a suitable format for training math reasoning models with reinforcement learning, the authors applied additional filters:

Deduplication and Decontamination

A straightforward deduplication step removed exact duplicates through string matching. To ensure diversity, semantic duplicates were identified and removed using the SemDeDup algorithm and embeddings from the sentence-transformers/all-MiniLM-L6-v2 model, with a cosine similarity threshold of 0.5.

Ensuring Solvability

Several methods were employed to ensure problems were solvable:
  • Language Filter: Non-English problems were removed using FastText language identification.
  • Hyperlink Detection: Problems containing hyperlinks were removed, as hyperlinks indicate external resources required for solving.
  • Model Solve Rate: Language models (Llama-3.1-8B and Llama-3.1-405B) were used to verify problem-answer pairs. Problems solved correctly by either model were retained. The authors acknowledged potential limitations due to model familiarity with the data or common mistakes, suggesting stronger math-specific models could mitigate these issues.

Ensuring Open-Endedness

Multiple-choice questions were filtered out using two methods:
  • Regular Expression Filters: Simple regex filters identified alphabetic (A, B, C, D) or numeric (1, 2, 3, 4) options.
  • Model-based Filters: Iterative prompt development with in-context examples was used with Llama-3.1-70B to identify and remove multiple-choice questions.

Ensuring Unique Verifiability

Verifiable answers are crucial for reinforcement learning. The authors implemented three sub-steps:
  • Answer Filter: Problems without explicitly boxed answers were removed.
  • Multi-part Question Filter: Multi-part questions, challenging to verify, were filtered using a dual approach combining regex and model-based methods.
  • Proof Filter: Proof-based questions were excluded due to verification complexity, reserved for future dataset versions.
<ins/>

Reformulation Strategy

Many valuable human-written problems were originally multiple-choice. To retain these, the authors developed a reformulation strategy:
  1. Extract key information.
  1. Reformulate problems as open-ended.
  1. Evaluate and verify reformulations.
The reformulation process is illustrated below:
notion image
Post-reformulation, additional filters ensured problems were solvable by either Llama-3.1-8B or Llama-3.1-405B.

Key Takeaways and Implications

The BIG-MATH paper outlines a comprehensive and rigorous methodology for creating high-quality datasets tailored for mathematical reasoning tasks. Its systematic approach to data cleaning and filtering is broadly applicable to other specialized domains.
Key insights from the paper include:
  1. Quality over Quantity: Prioritizing dataset quality through meticulous filtering and verification.
  1. Human-in-the-Loop Validation: Iterative human verification ensures datasets meet practical standards.
  1. Diverse Problem Types: Curating a balanced representation of mathematical concepts and difficulty levels.
  1. Verifiability: Emphasizing uniquely verifiable answers enhances suitability for reinforcement learning.
This work sets a benchmark for specialized dataset creation, highlighting the necessity of combining automated processes with human expertise. The methodology serves as a valuable reference for researchers addressing similar challenges in other domains.
The innovative reformulation of multiple-choice questions into open-ended formats demonstrates how valuable human-generated content can be preserved and adapted for specific training requirements. This approach holds significant potential for other fields where open-ended responses are preferred over multiple-choice formats.
<ins/>
Mar 24 Notes on LightRAGMar 6, Note on QwQ-32B
Loading...
Chengsheng Deng
Chengsheng Deng
Chengsheng Deng
Latest posts
Mar 24 Notes on LightRAG
Mar 24, 2025
Dec 6, Some Tests on o1
Mar 14, 2025
Mar 10, Note on BIG-MATH
Mar 10, 2025
Mar 6, Note on QwQ-32B
Mar 6, 2025
Jan 21, Notes on DeepSeek-R1
Mar 6, 2025
The First Pages of 2025 - My January & February Story
Mar 5, 2025
Announcement
🎉Welcome to my blog🎉 
To find me:
Twitter/X:My X
👏Have fun in my blog👏