定义奖励信号

To guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment. This signal measures the performance of the agent with respect to the task goals. In other words, for a given observation (state), the reward measures the effectiveness of taking a particular action. During training, an agent updates its policy based on the rewards received for different state-action combinations. For more information on the different types of agents and how they use the reward signal during training, see加固学习代理人.

一般来说，您提供积极的奖励，以鼓励某些代理行动和负面奖励（惩罚）来阻止其他行动。精心设计的奖励信号指导代理人以最大限度地提高长期奖励的期望。什么构成精心设计的奖励取决于您的申请和代理目标。

例如，当代理必须尽可能长时间执行任务时，常用策略是为代理成功执行任务以及当代理失败时的大惩罚的每次步骤都提供小的正奖励。这种方法鼓励更长的训练集，同时严重劝阻失败的剧集。有关使用此方法的示例，请参阅Train DQN Agent to Balance Cart-Pole System.

If your reward function incorporates multiple signals, such as position, velocity, and control effort, you must consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

您可以指定连续或离散奖励信号。在任何一种情况下，您都必须提供奖励信号，当动作和观察信号发生变化时提供丰富的信息。

For applications where control system specifications like cost functions and constraints are already available, you can also use generate rewards functions from such specifications.

连续奖励

连续奖励函数变化不断with changes in the environment observations and actions. In general, continuous reward signals improve convergence during training and can lead to simpler network structures.

连续奖励的一个例子是二次调节器（QR）成本函数，其中可以表达长期奖励

$J_{i} = - (s_{τ.}^{T} Q_{τ.} s_{τ.} + \sum_{j = i}^{τ.} s_{j}^{T} Q_{j} s_{j} + a_{j}^{T} R_{j} a_{j} + 2 s_{j}^{T} N_{j} a_{j})$

Here,Q_τ.,Q,R，和Nare the weight matrices.Q_τ.is the terminal weight matrix, applied only at the end of the episode. Also,sis the observation vector,ais the action vector, andτ.是剧集的终端迭代。这种成本函数的瞬时奖励是

$r_{i} = s_{i}^{T} Q_{i} s_{i} + a_{i}^{T} R_{i} a_{i} + 2 s_{i}^{T} N_{i} a_{i}$

此QR奖励结构鼓励驾驶s用最小的行动努力来零。基于QR的奖励结构是选择规则或静止点问题的良好奖励，例如摆动或调节双积分器的位置。对于使用QR奖励的培训示例，请参阅Train DQN Agent to Swing Up and Balance Pendulum和Train DDPG Agent to Control Double Integrator System.

Smooth continuous rewards, such as the QR regulator, are good for fine-tuning parameters and can provide policies similar to optimal controllers (LQR/MPC).

离散奖励

A discrete reward function varies discontinuously with changes in the environment observations or actions. These types of reward signals can make convergence slower and can require more complex network structures. Discrete rewards are usually implemented aseventsthat occur in the environment—for example, when an agent receives a positive reward if it exceeds some target value or a penalty when it violates some performance constraint.

While discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. For example, a region-based reward, such as a fixed reward when the agent is near a target location, can emulate final-state constraints. Also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

Mixed Rewards

In many cases, providing a mixed reward signal that has a combination of continuous and discrete reward components is beneficial. The discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. For example, in火车DDPG代理控制飞行机器人, the reward function has three components:r₁,r₂，和r₃.

$\begin{array}{l} r_{1} = 10 ((x_{t}^{2} + y_{t}^{2} + {θ.}_{t}^{2}) < 0.5) \\ r_{2} = - 100. (| x_{t} | \geq 20 | | | y_{t} | \geq 20) \\ r_{3} = - (0.2 {(R_{t - 1} + L_{t - 1})}^{2} + 0.3 {(R_{t - 1} - L_{t - 1})}^{2} + 0.03 x_{t}^{2} + 0.03 y_{t}^{2} + 0.02 {θ.}_{t}^{2}) \\ r = r_{1} + r_{2} + r_{3} \end{array}$

Here:

r₁是一种基于地区的连续奖励，仅适用于机器人的目标位置附近。
r₂是当机器人远离目标位置时提供大量惩罚。
r₃是一个持续的QR惩罚，适用于所有机器人状态。

从控制规范中奖励生成

对于已经存在的工作控制系统的应用程序，可能已经可用了规格函数或约束。在这些情况下，您可以使用GenerateRewardFunction.to generate a reward function, coded in MATLAB^®, that can be used as a starting point for reward design. This function allows you to generate rewards from:

在一个中定义的成本和约束规范mpc(Model Predictive Control Toolbox)要么nlmpc(Model Predictive Control Toolbox)控制器对象。此功能需要模型预测控制工具箱™软件。
Performance constraints defined in万博1manbetx^®Design Optimization™模型验证块。

在这两种情况下，违反约束时，使用惩罚功能计算负奖励，例如外观(default),双曲突出要么巴里巴尼亚职能。

从生成的奖励函数开始，您可以调整成本和惩罚权重，使用不同的惩罚功能，然后在环境中使用生成的奖励函数来培训代理。