Oct 12, Notes on Re-Reading & GSM-Symbolic

type

status

date

slug

summary

Introduction

In this blog, I share notes on two intriguing papers I recently read: "Re-Reading Improves Reasoning in Large Language Models" and "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models". These papers present an interesting contrast: one explores methods to enhance LLMs' reasoning capabilities, while the other reveals that LLMs lack genuine reasoning abilities.

Papers

Re-Reading Improves Reasoning in Large Language Models

Research Objective

Existing research on reasoning focuses on designing diverse thought-eliciting prompting strategies to elicit reasoning processes in the output phase, such as Chain of Thought (CoT). However, few studies concentrate on the input phases. The authors note that most LLMs are decoder-only with unidirectional attention, limiting each token's visibility to only previous tokens when encoding questions. This potentially impairs the global understanding of the question. Drawing inspiration from cognitive science studies showing that humans tend to re-read questions during learning and problem-solving to enhance comprehension, the authors apply this idea to LLMs. They call this approach "Re-Reading" the question as input (RE2).

Methodology

The method is straightforward:

Here, {Input Query} is a placeholder for the input query. This re-reading prompt can be integrated with other prompting techniques such as few-shot settings, self-consistency, CoT and more.

Key Findings

The authors present the following results from arithmetic reasoning benchmarks:

denotes that Vanilla is even superior to CoT prompting.

Results on commonsense and symbolic reasoning benchmarks are as follows:

denotes that Vanilla is even superior to CoT prompting.

In almost all scenarios, LLMs with RE2 achieve consistent improvements across both LLMs (davinci-003 and ChatGPT) and prompting methods (Vanilla and CoT).

When using Vanilla+RE2 on ChatGPT, some exceptions occur with the AQUA and MultiArith datasets. The authors suggest this could be due to ChatGPT's exposure to these datasets with CoT outputs during instruction fine-tuning. This means ChatGPT produces CoT-like output even with a vanilla prompt setting, and it even outperforms ChatGPT with the CoT setting.

The authors also demonstrate compatibility with few-shot prompting and self-consistency. They present evaluation results on arithmetic reasoning benchmarks under a few-shot setting:

As shown, incorporating the re-reading mechanism consistently enhances the performance of both prompting methods.

The authors then present evaluation results of re-reading with self-consistency:

The re-reading mechanism enhances performance in most scenarios, showcasing its compatibility with the self-consistency approach.

Limitations

The authors acknowledge that their work primarily consists of empirical studies with extensive experiments, lacking substantial theoretical analyses.

RE2 increases the input length, leading to a slight reduction in efficiency for longer questions during inference.

This work focuses exclusively on the impact of RE2 within the reasoning domain. Future research will explore its application in additional contexts such as multi-turn dialogue and multi-modal reasoning.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Research Objective

LLMs demonstrate potential in solving complex reasoning tasks, particularly in coding and mathematics. However, the question of whether current LLMs are genuinely capable of true logical reasoning remains an important research focus.

Contributions

The authors introduce GSM-Symbolic, which generates diverse variants of GSM8K questions using symbolic templates. This approach enables a more nuanced and reliable evaluation of LLMs' performance across various setups, moving beyond single-point accuracy metrics.

They question the reliability of currently reported results on GSM8K, demonstrating that LLMs' performance can be viewed as a distribution with unwarranted variance across different instantiations of the same question.

The research reveals that LLMs are highly sensitive to changes in numerical values.

To further probe the reasoning abilities of LLMs, the authors introduce the GSM-NoOp dataset.

Methodology

Template Generation

The authors generate symbolic templates from a specific example in the GSM8K test set. They provide the following illustrative example:

Experimental Setup

The authors evaluate more than 20 open-source models of various sizes, ranging from 2B to 27B parameters, and include state-of-the-art closed models such as GPT-4o-mini, GPT-4o, o1-mini, and o1-preview.

For their experiments, the authors create 50 datasets, each containing 100 examples. They adopt a standard evaluation approach used for GSM8K and other mathematical benchmarks, which involves Chain-of-Thought (CoT) prompting with 8 examples and greedy decoding. Interestingly, they note that the number of examples in the prompt doesn't significantly affect the results.

Experiments & Key Findings

The first experiments is to evaluate the performance of several state-of-the art-models on GSM-Symbolic。 The results are as follows:

As shown, all models exhibit significant variance across different sets. The authors find that this variation persists even when only changing names and values in the questions while keeping the overall reasoning steps needed to solve a question the same.

Another noteworthy point is that the performance on the original 100 GSM8K examples used as templates often deviates by more than one standard deviation from the center of the GSM-Symbolic performance distribution.

The next experiment is to investigate the several factors contribute to the performance variation of the models.

<ins/>

The authors first examine the type of change to understand the difference between altering names versus changing numbers. Here are the results:

The figure reveals lower variance when changing names compared to numbers, although performance variation persists in both cases. This demonstrates the fragility of state-of-the-art LLMs' reasoning capabilities. Such a level of variability would be unexpected from a grade-school student with genuine mathematical understanding.

The authors then study the impact of question difficulty on model performance. They generate several new templates from GSM-Symbolic by removing or adding one or two clauses to the questions and conduct the experiment. The results are as follows:

Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses.

The impact of increasing the number of clauses on performance: As the difficulty increases from GSM-M1→ GSM-Symb→ GSM-P1→ GSM-P2, the distribution of performance shifts to the left (i.e., accuracy decreases), and the variance increases.

As we can see, the rate of accuracy drop increases as the difficulty increases. This suggests that the models are not performing formal reasoning, since the number of required reasoning steps increases linearly, but the rate of accuracy drop appears to be faster.

The authors also introduce GSM-NoOp to challenge the reasoning capabilities of language models. The performance of models drops significantly on GSM-NoOp. The following figures show examples from the GSM-NoOp dataset and the experiment results.

As shown in (a), there's a catastrophic performance decline across all tested models, even with stronger models such as o1-preview. To understand this performance drop, the authors conduct another experiment where they change the source of the 8-shot examples and report the results in (b) and (c).

For NoOp-Symb, they use GSM-Symbolic shots of the same question, and for NoOp-NoOp, they use GSM-NoOp shots of different questions. Overall, they demonstrate that while some models' performance improves due to changing the source of 8-shot examples, there are still instances where performance decreases. Notably, the models don't perform nearly as well on GSM-NoOp as they do on GSM8K and GSM-Symbolic.

<ins/>

Conclusion

I've made some notes on two interesting papers. For more details, please read the original papers. In my personal opinion, it's premature to conclude that LLMs fail at reasoning tasks. The current limitations may not persist in the future. 🫡🫡