Define Reward Signals

To guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment. This signal measures the performance of the agent with respect to the task goals. In other words, for a given observation (state), the reward measures the effectiveness of taking a particular action. During training, an agent updates its policy based on the rewards received for different state-action combinations. For more information on the different types of agents and how they use the reward signal during training, seeReinforcement Learning Agents.

In general, you provide a positive reward to encourage certain agent actions and a negative reward (penalty) to discourage other actions. A well-designed reward signal guides the agent to maximize the expectation of the long-term reward. What constitutes a well-designed reward depends on your application and the agent goals.

For example, when an agent must perform a task for as long as possible, a common strategy is to provide a small positive reward for each time step that the agent successfully performs the task and a large penalty when the agent fails. This approach encourages longer training episodes while heavily discouraging episodes that fail. For an example that uses this approach, seeTrain DQN Agent to Balance Cart-Pole System.

If your reward function incorporates multiple signals, such as position, velocity, and control effort, you must consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

You can specify either continuous or discrete reward signals. In either case, you must provide a reward signal that provides rich information when the action and observation signals change.

For applications where control system specifications like cost functions and constraints are already available, you can also use generate rewards functions from such specifications.

Continuous Rewards

一个连续奖励函数不同案子tinuously with changes in the environment observations and actions. In general, continuous reward signals improve convergence during training and can lead to simpler network structures.

An example of a continuous reward is the quadratic regulator (QR) cost function, where the long-term reward can be expressed as

$J_{i} = - (s_{τ}^{T} Q_{τ} s_{τ} + \sum_{j = i}^{τ} s_{j}^{T} Q_{j} s_{j} + a_{j}^{T} R_{j} a_{j} + 2 s_{j}^{T} N_{j} a_{j})$

Here,Q_τ,Q,R, andNare the weight matrices.Q_τis the terminal weight matrix, applied only at the end of the episode. Also,sis the observation vector,ais the action vector, andτis the terminal iteration of the episode. The instantaneous reward for this cost function is

$r_{i} = s_{i}^{T} Q_{i} s_{i} + a_{i}^{T} R_{i} a_{i} + 2 s_{i}^{T} N_{i} a_{i}$

This QR reward structure encourages drivingsto zero with minimal action effort. A QR-based reward structure is a good reward to choose for regulation or stationary point problems, such as pendulum swing-up or regulating the position of the double integrator. For training examples that use a QR reward, seeTrain DQN Agent to Swing Up and Balance PendulumandTrain DDPG Agent to Control Double Integrator System.

Smooth continuous rewards, such as the QR regulator, are good for fine-tuning parameters and can provide policies similar to optimal controllers (LQR/MPC).

Discrete Rewards

A discrete reward function varies discontinuously with changes in the environment observations or actions. These types of reward signals can make convergence slower and can require more complex network structures. Discrete rewards are usually implemented aseventsthat occur in the environment—for example, when an agent receives a positive reward if it exceeds some target value or a penalty when it violates some performance constraint.

While discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. For example, a region-based reward, such as a fixed reward when the agent is near a target location, can emulate final-state constraints. Also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

Mixed Rewards

In many cases, providing a mixed reward signal that has a combination of continuous and discrete reward components is beneficial. The discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. For example, inTrain DDPG Agent to Control Flying Robot, the reward function has three components:r₁,r₂, andr₃.

$\begin{array}{l} r_{1} = 10 ((x_{t}^{2} + y_{t}^{2} + θ_{t}^{2}) < 0.5) \\ r_{2} = - 100 (| x_{t} | \geq 20 | | | y_{t} | \geq 20) \\ r_{3} = - (0.2 {(R_{t - 1} + L_{t - 1})}^{2} + 0.3 {(R_{t - 1} - L_{t - 1})}^{2} + 0.03 x_{t}^{2} + 0.03 y_{t}^{2} + 0.02 θ_{t}^{2}) \\ r = r_{1} + r_{2} + r_{3} \end{array}$

Here:

r₁is a region-based continuous reward that applies only near the target location of the robot.
r₂is a discrete signal that provides a large penalty when the robot moves far from the target location.
r₃is a continuous QR penalty that applies for all robot states.

Reward Generation from Control Specifications

For applications where a working control system already exists, specifications such as cost functions or constraints might already be available. In these cases, you can usegenerateRewardFunctionto generate a reward function, coded in MATLAB^®, that can be used as a starting point for reward design. This function allows you to generate rewards from:

Cost and constraint specifications defined in anmpc(Model Predictive Control Toolbox)ornlmpc(Model Predictive Control Toolbox)controller object. This feature requires Model Predictive Control Toolbox™ software.
性能约束defined in万博1manbetx^®Design Optimization™model verification blocks.

In both cases, when constraints are violated, a negative reward is calculated using penalty functions such asexteriorPenalty(default),hyperbolicPenaltyorbarrierPenaltyfunctions.

Starting from the generated reward function, you can tune the cost and penalty weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.