Unlocking Deeper Logic in Language Models with Dynamic Rewards

TLDR: A new research paper introduces DynamicReasoningEfficiencyReward (DRER), a reinforcement learning framework that enhances large language models’ (LLMs) reasoning by rewarding the quality and efficiency of their Chain-of-Thought (CoT) steps, not just final answers. Coupled with LogicTree, a novel deductive reasoning dataset, DRER significantly boosts LLM accuracy, logical consistency, and efficiency, outperforming larger models on complex logical tasks and showing generalization to other benchmarks.

Large language models (LLMs) have made incredible strides in understanding and generating human-like text. However, when it comes to complex reasoning, especially formal logic, there’s still a significant challenge. Many current methods use reinforcement learning (RL) to improve LLMs’ reasoning, but they often fall short by only rewarding the final answer, not the quality of the thought process that leads to it. This can result in what researchers call “decorative” chains of thought – steps that look like reasoning but don’t actually help the model arrive at the correct conclusion.

A new research paper, titled “Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL,” introduces an innovative framework called DynamicReasoningEfficiencyReward (DRER) to address these limitations. The authors, Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, and Honggang Zhang from Beijing University of Posts and Telecommunications, propose a system that not only rewards correct answers but also the effectiveness and efficiency of the reasoning steps themselves.

Dynamic Reasoning Efficiency Reward (DRER)

DRER is designed as a plug-and-play RL reward framework that refines how LLMs learn to reason. It has two main components:

Reasoning Quality Reward: This part of DRER gives specific credit to reasoning chains (Chain-of-Thought or CoT) that genuinely increase the likelihood of the model predicting the correct answer. Essentially, it encourages the model to generate CoT tokens that are truly beneficial, rather than just filler.
Dynamic Length Advantage: To prevent models from generating overly long or short responses, this mechanism adjusts the reward based on the length of the generated answer. If a response’s length deviates too much from an ideal range (determined from a validation set), its advantage is reduced, helping to stabilize training and promote concise, yet complete, reasoning.

Introducing LogicTree: A New Benchmark for Deductive Reasoning

To rigorously test and train models with DRER, the researchers also developed a new dataset called LogicTree. Unlike many existing datasets that mix logical reasoning with mathematical problems or real-world knowledge, LogicTree focuses purely on formal deductive reasoning. It’s built around seven fundamental inference rules from mathematical logic, forming a nested binary reasoning tree where difficulty can be precisely controlled.

LogicTree is unique because it’s designed to be independent of semantic meaning, meaning models can’t rely on prior knowledge or common sense to solve problems. It also includes features like extracting intermediate conclusions as separate questions to assess the completeness of reasoning, and a logical consistency metric to see if models can apply the same logical principles across different natural language phrasings.

Impressive Results and Generalization

The experiments conducted using a Qwen2.5-7B-Instruct-1M model trained with DRER on LogicTree showed remarkable improvements. After just 400 training steps, the model’s accuracy on LogicTree soared from an initial 7% to nearly 60%. This performance is comparable to, and in some aspects surpasses, much larger and more advanced models like GPT-o3-mini, which achieved only 18% average accuracy on the same benchmark.

Crucially, the DRER-trained model maintained a 31% accuracy rate even on the most challenging problems with a reasoning depth of 8, where many other state-of-the-art models struggled significantly, often scoring close to zero. The average confidence of CoT-augmented answers also increased by 30%, and the model became more efficient, using fewer tokens per problem while achieving higher accuracy.

Beyond LogicTree, the model also demonstrated a modest ability to generalize its enhanced logical reasoning skills to other benchmarks, including ZebraLogic, ProntoQA, AIME24, and MMLU-redux, indicating that the training fostered broader logical capabilities.

Also Read:

Looking Ahead

The DRER framework and LogicTree dataset offer a promising path toward making LLMs more reliable and interpretable in their reasoning. By directly integrating reasoning quality into the learning objective, models can develop more consistent and accurate chains of thought. While the current work focuses on deductive reasoning and a specific model size, future research aims to expand DRER to higher-order logic, explore more cost-effective reward approximations, and incorporate human feedback to further refine reasoning-aligned training.

For more technical details, you can refer to the full research paper here.

Software Engineer Develops AI Search Engine to Combat Web Spam, Challenging Google’s Dominance

India’s Deep Tech Innovation Driven by Emerging AI Training and Ethics Roles

Navigating the AI-Driven Landscape: Essential Local Search Strategies for Businesses in 2025

Artificial Intelligence: Empowering Women’s Livelihoods Across India, Bridging the Digital Divide

Businesses Leverage Process Intelligence to Navigate Generative AI Complexities

The Atlantic Investigation Reveals Millions of YouTube Videos Scraped for Generative AI Training

Software Engineer Develops AI Search Engine to Combat Web Spam, Challenging Google’s Dominance

India’s Deep Tech Innovation Driven by Emerging AI Training and Ethics Roles

Navigating the AI-Driven Landscape: Essential Local Search Strategies for Businesses in 2025

Artificial Intelligence: Empowering Women’s Livelihoods Across India, Bridging the Digital Divide

Businesses Leverage Process Intelligence to Navigate Generative AI Complexities

The Atlantic Investigation Reveals Millions of YouTube Videos Scraped for Generative AI Training

Unlocking Deeper Logic in Language Models with Dynamic Rewards

Dynamic Reasoning Efficiency Reward (DRER)

Introducing LogicTree: A New Benchmark for Deductive Reasoning

Impressive Results and Generalization

Looking Ahead

Gen AI News and Updates

Driving AI Transformation: Praveen Koushik’s Impact on Enterprise Intelligence Through Data Products

Software Engineer Develops AI Search Engine to Combat Web Spam, Challenging Google’s Dominance

Progress Software Launches Agentic RAG SaaS Platform for Trustworthy Generative AI

Bridging AI’s Domain Gap with a Unified Style Approach

Enhanced Speech Recognition for Vietnamese-English Code-Switching with TSPC

EU AI Sandboxes: Navigating the Hurdles of Innovation and Compliance

Guiding Monocular 3D Detection with Segmentation Maps

Understanding Behavior: The Unified Interaction Foundation Model

New Attack Method Uncovers Significant Data Privacy Risks in AI’s Retrieval-Augmented Generation

DreamAudio: Crafting Unique Sounds with Personalized Text-to-Audio Generation

TinyDef-DETR: Advancing Small Defect Detection for UAV Power Line Inspection

BranchGRPO: A New Approach for Stable and Fast Generative Model Alignment

Bridging the Gap: Large Language Models for Binary Security Patch Detection

PolicyEvolve: Advancing AI Strategies in Multi-Player Games with Evolving Programmatic Policies

ARIES: Bridging Time Series Data Characteristics with Deep Forecasting Model Selection

Understanding AI Model Reuse in Open-Source Software

Unlocking Scientific Narratives: A New Database for AI-Driven Materials Research

Bridging Behavioral Economics and AI: New Algorithms for Time-Inconsistent Decision-Making

SpecSwin3D: A New AI Model for High-Resolution Hyperspectral Imagery

Subscribe to get the latest news and updates