TLDR: A new research paper introduces Stochastic Reward Machines (SRMs) and the SRMI algorithm, which enable reinforcement learning agents to effectively learn in environments with noisy and non-Markovian (history-dependent) reward functions. Unlike traditional reward machines that assume deterministic rewards, SRMs model rewards as probability distributions. The SRMI algorithm learns these SRMs efficiently using constraint solving, outperforming existing methods by directly addressing reward stochasticity without requiring costly trajectory re-sampling.
Reinforcement Learning (RL) has shown remarkable success in teaching agents to perform complex tasks. At its core, RL relies on a reward function that guides an agent’s learning process. Traditionally, these reward functions are assumed to be ‘Markovian,’ meaning the reward an agent receives depends only on its immediate state and action. However, many real-world tasks involve rewards that are ‘non-Markovian’ – they depend on a sequence of actions or events over time, not just the last step.
The Challenge of Non-Markovian Rewards
Consider a task where an agent needs to collect specific items in a certain order before delivering them to a marketplace. The final reward for delivery might depend on whether the items were collected correctly and in sequence. This is a classic non-Markovian scenario. To address this, ‘reward machines’ (RMs) were introduced. These are automaton-like structures that augment the environment’s state space, effectively remembering the temporal sequence of events and turning a non-Markovian reward into a Markovian one, allowing standard RL algorithms to be used more effectively.
The Problem with Noisy Rewards
While reward machines have been a powerful tool, they operate under an idealized assumption: rewards are deterministic and free of noise. In reality, rewards are often stochastic or noisy. For instance, in a ‘mining’ task, the value of ore might fluctuate due to purity or market prices. When faced with such noisy rewards, existing reward machine learning algorithms falter. They either fail to find a consistent reward machine or create excessively large, complex machines that ‘overfit’ the noise, making them impractical and difficult to interpret.
Introducing Stochastic Reward Machines (SRMs)
To overcome this significant limitation, researchers Jan Corazza, Ivan Gavran, and Daniel Neider have introduced a novel concept: ‘Stochastic Reward Machines’ (SRMs). Unlike traditional RMs that output a single real number as a reward, SRMs are designed to output ‘probability distributions’ over possible rewards. This fundamental change allows SRMs to naturally capture and model the inherent noise and stochasticity present in many real-world reward functions.
The SRMI Algorithm: Learning from Noise
Alongside SRMs, the team developed a new algorithm called ‘Stochastic Reward Machine Inference’ (SRMI). This algorithm is designed to learn minimal SRMs from the explorations of an RL agent. SRMI works by iteratively refining its hypothesis about the underlying SRM. When the agent encounters a ‘counterexample’ – a sequence of actions and rewards that contradicts the current SRM hypothesis – SRMI updates its understanding. This update can involve either adjusting the parameters of the existing SRM or, if necessary, inferring an entirely new SRM structure using a sophisticated constraint-solving approach.
A key advantage of SRMI over previous methods and a ‘baseline’ approach is its efficiency. The baseline method attempts to average out noise by replaying trajectories multiple times, which can be very costly and sometimes impossible in dynamic environments. SRMI, however, directly incorporates the concept of ‘epsilon-consistency’ (a measure of how well the observed rewards fit the expected distributions within a certain noise bound) into its constraint-solving process, eliminating the need for extensive replaying of trajectories.
Also Read:
- Integrating Causal Knowledge for Efficient Reinforcement Learning
- Enhancing AI Model Alignment by Resolving Feedback Inconsistencies
Demonstrated Effectiveness
The effectiveness of SRMI was demonstrated through two case studies: a ‘Mining’ environment and a ‘Harvest’ environment. In both scenarios, SRMI significantly outperformed existing methods that do not explicitly deal with noise, as well as the baseline approach. Notably, SRMI also showed no performance penalty when applied to non-stochastic environments, indicating its robustness and versatility.
This research marks a significant step forward in reinforcement learning, providing a robust framework for agents to learn in environments where rewards are not only complex and history-dependent but also inherently noisy. By explicitly modeling stochasticity, SRMs and the SRMI algorithm pave the way for more practical and interpretable RL systems in real-world applications. You can read the full research paper here: Reinforcement Learning with Stochastic Reward Machines.