spot_img
HomeResearch & DevelopmentUnlocking Deeper Logic in Language Models with Dynamic Rewards

Unlocking Deeper Logic in Language Models with Dynamic Rewards

TLDR: A new research paper introduces DynamicReasoningEfficiencyReward (DRER), a reinforcement learning framework that enhances large language models’ (LLMs) reasoning by rewarding the quality and efficiency of their Chain-of-Thought (CoT) steps, not just final answers. Coupled with LogicTree, a novel deductive reasoning dataset, DRER significantly boosts LLM accuracy, logical consistency, and efficiency, outperforming larger models on complex logical tasks and showing generalization to other benchmarks.

Large language models (LLMs) have made incredible strides in understanding and generating human-like text. However, when it comes to complex reasoning, especially formal logic, there’s still a significant challenge. Many current methods use reinforcement learning (RL) to improve LLMs’ reasoning, but they often fall short by only rewarding the final answer, not the quality of the thought process that leads to it. This can result in what researchers call “decorative” chains of thought – steps that look like reasoning but don’t actually help the model arrive at the correct conclusion.

A new research paper, titled “Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL,” introduces an innovative framework called DynamicReasoningEfficiencyReward (DRER) to address these limitations. The authors, Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, and Honggang Zhang from Beijing University of Posts and Telecommunications, propose a system that not only rewards correct answers but also the effectiveness and efficiency of the reasoning steps themselves.

Dynamic Reasoning Efficiency Reward (DRER)

DRER is designed as a plug-and-play RL reward framework that refines how LLMs learn to reason. It has two main components:

  • Reasoning Quality Reward: This part of DRER gives specific credit to reasoning chains (Chain-of-Thought or CoT) that genuinely increase the likelihood of the model predicting the correct answer. Essentially, it encourages the model to generate CoT tokens that are truly beneficial, rather than just filler.

  • Dynamic Length Advantage: To prevent models from generating overly long or short responses, this mechanism adjusts the reward based on the length of the generated answer. If a response’s length deviates too much from an ideal range (determined from a validation set), its advantage is reduced, helping to stabilize training and promote concise, yet complete, reasoning.

Introducing LogicTree: A New Benchmark for Deductive Reasoning

To rigorously test and train models with DRER, the researchers also developed a new dataset called LogicTree. Unlike many existing datasets that mix logical reasoning with mathematical problems or real-world knowledge, LogicTree focuses purely on formal deductive reasoning. It’s built around seven fundamental inference rules from mathematical logic, forming a nested binary reasoning tree where difficulty can be precisely controlled.

LogicTree is unique because it’s designed to be independent of semantic meaning, meaning models can’t rely on prior knowledge or common sense to solve problems. It also includes features like extracting intermediate conclusions as separate questions to assess the completeness of reasoning, and a logical consistency metric to see if models can apply the same logical principles across different natural language phrasings.

Impressive Results and Generalization

The experiments conducted using a Qwen2.5-7B-Instruct-1M model trained with DRER on LogicTree showed remarkable improvements. After just 400 training steps, the model’s accuracy on LogicTree soared from an initial 7% to nearly 60%. This performance is comparable to, and in some aspects surpasses, much larger and more advanced models like GPT-o3-mini, which achieved only 18% average accuracy on the same benchmark.

Crucially, the DRER-trained model maintained a 31% accuracy rate even on the most challenging problems with a reasoning depth of 8, where many other state-of-the-art models struggled significantly, often scoring close to zero. The average confidence of CoT-augmented answers also increased by 30%, and the model became more efficient, using fewer tokens per problem while achieving higher accuracy.

Beyond LogicTree, the model also demonstrated a modest ability to generalize its enhanced logical reasoning skills to other benchmarks, including ZebraLogic, ProntoQA, AIME24, and MMLU-redux, indicating that the training fostered broader logical capabilities.

Also Read:

Looking Ahead

The DRER framework and LogicTree dataset offer a promising path toward making LLMs more reliable and interpretable in their reasoning. By directly integrating reasoning quality into the learning objective, models can develop more consistent and accurate chains of thought. While the current work focuses on deductive reasoning and a specific model size, future research aims to expand DRER to higher-order logic, explore more cost-effective reward approximations, and incorporate human feedback to further refine reasoning-aligned training.

For more technical details, you can refer to the full research paper here.

Ananya Rao
Ananya Raohttp://edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -