This example shows how to train multiple agents to collaboratively perform path-following control (PFC) for a vehicle. The goal of PFC is to make the ego vehicle travel at a set velocity while maintaining a safe distance from a lead car by controlling longitudinal acceleration and braking, and also while keeping the vehicle travelling along the centerline of its lane by controlling the front steering angle. For more information on PFC, see<一个href="//www.tianjin-qmedu.com/fr/fr/help/mpc/ref/pathfollowingcontrolsystem.html" class="a">路径跟随控制系统 训练加固学习代理执行PFC的示例显示在<一个href="//www.tianjin-qmedu.com/fr/fr/help/reinforcement-learning/ug/train-ddpg-agent-for-path-following-control.html" class="a">训练DDPG代理进行路径遵循控制 The trained agents perform PFC through cooperative behavior and achieve satisfactory results. The environment for this example includes a simple bicycle model for the ego car and a simple longitudinal model for the lead car. The training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking, while also keeping the ego car travelling along the centerline of its lane by controlling the front steering angle. 加载环境参数。 Open the Simulink model. In this model, the two reinforcement learning agents (RL Agent1 and RL Agent2) provide longitudinal acceleration and steering angle signals, respectively. The simulation terminates when any of the following conditions occur.
(横向偏差的大小超过1)
((longitudinal velocity of the ego car drops below 0.5.
((distance between the ego and lead car is below zero) Forthe longitudinal controller (RL Agent1): 自我汽车的参考速度 The observations from the environment contain the longitudinal measurements: the velocity error
,它的积分 The action signal consists of continuous acceleration values between -3 and 2 m/s^2. The reward
,在每个时间步骤中提供
这里,
if the simulation is terminated, otherwise
。
if
, 否则 Forthe lateral controller (RL Agent2): 来自环境的观察结果包含横向测量:横向偏差 动作信号由离散的转向角度组成,该动作的值从-15度(-0.2618 rad)到15度(0.2618 RAD),步骤为1度(0.0175 RAD)。 The reward
,在每个时间步骤中提供
这里,
if the simulation is terminated, otherwise
。
, 否则 The logical terms in the reward functions (
,,,, Create the observation and action specifications for longitudinal control loop. Create the observation and action specifications for lateral control loop. 结合观察和操作规范一个s a cell array. 创建一个Simul万博1manbetxink环境接口,指定两个代理块的块路径。块路径的顺序必须匹配观察和动作规范单元格数组的顺序。 使用该环境指定重置函数 Forthis example you create two reinforcement learning agents. First, fix the random seed for reproducibility. 在此示例中,两种代理在相同的样本时间运行。设置样本时间值(以秒为单位)。 纵向控制回路的代理是DDPG代理。DDPG代理在使用评论家价值函数表示的观察和动作的情况下近似长期奖励,并使用Actor策略表示选择动作。有关创建深神网络价值功能和策略表示的更多信息,请参见<一个href="//www.tianjin-qmedu.com/fr/fr/help/reinforcement-learning/ug/create-policy-and-value-function-representations.html" class="a">创建策略和价值功能表示 使用 横向控制环的代理是DQN代理。DQN代理在使用评论家价值函数表示的观察结果和行动给定的情况下近似长期奖励。 使用 指定培训选项。对于此示例,请使用以下选项。 Run each training episode for at most 5000 episodes, with each episode lasting at most 在“情节经理”对话框中显示培训进度(设置 当平均奖励分别大于480和1195时,停止培训DDPG和DQN代理。当一个代理商达到其停止标准时,它会在不学习的情况下模拟其自己的政策,而另一代代理继续培训。 Train the agents using the<一个href="//www.tianjin-qmedu.com/fr/fr/help/reinforcement-learning/ref/rl.agent.rlqagent.train.html" class="a"> The following figure shows a snapshot of the training progress for the two agents. 为了验证受过训练的代理的性能,请通过删除以下命令来模拟Simulink环境中的代理。万博1manbetx有关代理模拟的更多信息,请参阅<一个href="//www.tianjin-qmedu.com/fr/fr/help/reinforcement-learning/ref/rlsimulationoptions.html" class="a"> 要使用确定性初始条件演示训练的代理,请在Simulink中模拟该模型。万博1manbetx The following plots show the results when the lead car is 70 m ahead of the ego car at the beginning of simulation. The lead car changes speed from 24 m/s to 30 m/s periodically (top-right plot). The ego car maintains a safe distance throughout the simulation (bottom-right plot). 从0到30秒,自我汽车跟踪设定速度(最高右图)并经历了一些加速度(左上图)。之后,加速度降低到0。 左下图显示横向偏差。如图所示,横向偏差在1秒内大大减少。横向偏差保持小于0.1 m。
Overview
Create Environment
multiAgentPFCParams
mdl =
obsInfo1 = rlNumericSpec([3 1]); actInfo1 = rlNumericSpec([1 1],“下限”
obsInfo2 = rlNumericSpec([6 1]); actInfo2 = rlFiniteSetSpec((-15:15)*pi/180);
obsinfo = {obsinfo1,obsinfo2};actinfo = {actinfo1,actinfo2};
blks = mdl + [
pFcResetFcn
r一个ndomly sets the initial poses of the lead and ego vehicles at the beginning of every episode during training.env.Resetfcn = @pfcresetfcn;
创建代理
RNG(0)
TS = 0.1;
Longitudinal Control
Agent1 = createAccagent(obsinfo1,actinfo1,ts);
Lateral Control
一个Gent2 = createLKAAgent(obsInfo2,actInfo2,Ts);
Train Agents
maxsteps
time steps.Plots
options).Tf = 60;% simulation timemaxepisodes = 5000; maxsteps = ceil(Tf/Ts); trainingOpts = rlTrainingOptions(...
火车
。To train the agent yourself, setdoTraining
至doTraining
至doTraining = false;ifdoTraining%训练代理。
Simulate Agents
rlSimulationOptions
一个nd<一个href="//www.tianjin-qmedu.com/fr/fr/help/reinforcement-learning/ref/rl.env.abstractenv.sim.html" class="a">sim
。%simOptions = rlSimulationOptions('MaxSteps',MaxSteps);
e1_initial = -0.4; e2_initial = 0.1; x0_lead = 80; sim(mdl)
也可以看看