此示例显示了如何在此处转换PI控制器据code class="literal">水缸据/code>万博1manbetxSimulink®模型到钢筋学习深度确定性政策梯度(DDPG)代理。有关在Matlab®中列出DDPG代理的示例,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ug/train-ddpg-agent-to-balance-double-integrator-system.html" class="a">火车DDPG代理控制双积分系统据/a>。据/p>
这个例子的原始模型是水箱模型。目标是控制水箱中的水的水平。有关水箱模型的更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/slcontrol/gs/watertank-simulink-model.html" class="a">Watertank 万博1manbetxSimulink模型据/a>(万博1manbetxSimulink Control Design)据/span>。据/p>
通过进行以下更改来修改原始模型:据/p>
删除PID控制器。据/p> 插入RL代理块。据/p> 连接观察向量据span class="inlineequation">
, 在哪里据span class="inlineequation">
是坦克的高度,据span class="inlineequation">
, 和据span class="inlineequation">
是参考高度。据/p> 设立奖励据span class="inlineequation">
。据/p> 配置终端信号,使得模拟停止如果据span class="inlineequation">
或者据span class="inlineequation">
。据/p> 得到的模型是据code class="literal">Rlwatertank.slx.据/code>。有关此模型和更改的更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ug/create-simulink-environments-for-reinforcement-learning.html" class="a">创建Simul万博1manbetxink强化学习环境据/a>。据/p>
创建环境模型包括定义以下内容:据/p>
操作和观察信号表示代理用于与环境交互。有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rl.util.rlnumericspec.html" class="a"> 奖励信号代理用于衡量其成功。有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ug/define-reward-signals.html" class="a">定义奖励信号据/a>。据/p> 定义观察规范据code class="literal">obsinfo.据/code>和行动规范据code class="literal">Actinfo.据/code>。据/p>
构建环境界面对象。据/p>
设置自定义重置函数,随机化模型的参考值。据/p>
指定模拟时间据code class="literal">TF.据/code>和代理采样时间据code class="literal">TS.据/code>片刻之间。据/p>
修复随机发生器种子以进行再现性。据/p>
鉴于观察和动作,DDPG代理使用批评值函数表示来估计长期奖励。要创建评论家,首先创建一个具有两个输入,观察和动作的深度神经网络,以及一个输出。有关创建深度神经网络值函数表示的更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ug/create-policy-and-value-function-representations.html" class="a">创建策略和值函数表示据/a>。据/p>
查看批评批评网络配置。据/p>
指定使用批评者的选项据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rlrepresentationoptions.html" class="a"> 使用指定的深度神经网络和选项创建批读表示。您还必须指定从环境界面获取的批评者的操作和观察规范。有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rlqvaluerepresentation.html" class="a"> 鉴于观察,DDPG代理决定使用演员表示采取的行动。要创建演员,首先创建一个输入的深度神经网络,一个输入,观察和一个输出,动作。据/p>
以与评论家类似的方式构建演员。有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rldeterministicactorrepresentation.html" class="a"> 要创建DDPG代理,首先使用DDPG代理选项使用据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rlddpgagentoptions.html" class="a"> 然后,使用指定的Actor表示,批评者表示和代理选项创建DDPG代理。有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rlddpgagent.html" class="a"> 要培训代理,首先指定培训选项。对于此示例,请使用以下选项:据/p>
最多运行每个培训据code class="literal">5000据/code>剧集。指定每个剧集最多持续据code class="literal">CEIL(TF / TS)据/code>(那是据code class="literal">200.据/code>)时间步骤。据/p> 在Episode Manager对话框中显示培训进度(设置据code class="literal">绘图据/code>选项)并禁用命令行显示(设置据code class="literal">verb据/code>选择据code class="literal">错误的据/code>)。据/p> 当代理收到平均累积奖励时停止培训大于据code class="literal">800据/code>超过据code class="literal">20.据/code>连续发作。此时,药剂可以控制罐中的水平。据/p> 有关更多信息,请参阅据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rltrainingoptions.html" class="a"> 使用该代理商培训据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rl.agent.rlqagent.train.html" class="a"> 通过仿真验证学习代理针对模型。据/p>
Open_System(据span style="color:#A020F0">'rlwatertank'据/span>)据/pre>
创建环境界面据/h3>
rlnumericspec.据/code>和据a href="//www.tianjin-qmedu.com/nl/help/reinforcement-learning/ref/rl.util.rlfinitesetspec.html" class="a">
rlfinitesetspec.据/code>。据/p>
ObsInfo = rlnumericspec([3 1],据span style="color:#0000FF">......据/span>'lowerimit'据/span>,[ - inf-inf 0]'',据span style="color:#0000FF">......据/span>'上限'据/span>,[inf inf inf]');ObsInfo.name =.据span style="color:#A020F0">'观察'据/span>;Obsinfo.description =.据span style="color:#A020F0">'集成错误,错误和测量高度'据/span>;numobservations = Obsinfo.dimension(1);Actinfo = rlnumericspec([1 1]);Actinfo.name =.据span style="color:#A020F0">'流动'据/span>;数量= Actinfo.dimension(1);据/pre>
ent = rl万博1manbetxsimulinkenv(据span style="color:#A020F0">'rlwatertank'据/span>那据span style="color:#A020F0">'rlwatertank / rl代理'据/span>那据span style="color:#0000FF">......据/span>Obsinfo,Actinfo);据/pre>
env.resetfcn = @(in)localresetfcn(in);据/pre>
ts = 1.0;tf = 200;据/pre>
RNG(0)据/pre>
创建DDPG代理据/h3>
statepath = [featureInputLayer(numobservations,据span style="color:#A020F0">'正常化'据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“国家”据/span>)全连接列(50,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评福尔福克'据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'rictrelu1'据/span>)全康校长(25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评福尔2'据/span>)];ActionPath = [featuredupputlayer(nations,据span style="color:#A020F0">'正常化'据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'行动'据/span>)全康校长(25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评FC1'据/span>)];commonpath = [附加层(2,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'添加'据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批判杂志'据/span>)全康连接层(1,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评'据/span>)];批判性= layergraph();批评网络= addlayers(批判性,州路径);批评网络= addlayers(批判性,ActionPath);批评网络= addlayers(批判性,commonpath);批评网络= ConnectLayers(批评者,据span style="color:#A020F0">'批评福尔2'据/span>那据span style="color:#A020F0">'添加/ IN1'据/span>);批评网络= ConnectLayers(批评者,据span style="color:#A020F0">'批评FC1'据/span>那据span style="color:#A020F0">'添加/ in2'据/span>);据/pre>
图情节(批评性)据/pre>
rlrepresentationOptions.据/code>。据/p>
批评= rlrepresentationOptions(据span style="color:#A020F0">'学习'据/span>,1e-03,据span style="color:#A020F0">'gradientthreshold'据/span>,1);据/pre>
rlqvalueerepresentation据/code>。据/p>
评论家= rlqvalueerepresentation(批评,undernfo,Actinfo,据span style="color:#A020F0">'观察'据/span>,{据span style="color:#A020F0">“国家”据/span>},据span style="color:#A020F0">'行动'据/span>,{据span style="color:#A020F0">'行动'据/span>},批评);据/pre>
RLDETerminyActorRepresentation据/code>。据/p>
ActorNetWork = [FeatureInputLayer(NumObServations,据span style="color:#A020F0">'正常化'据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“国家”据/span>)全康连接层(3,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'Actorfc'据/span>)Tanhlayer(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'actortanh'据/span>)全连接列(数量,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'行动'据/span>)];ACTOROPTIONS = RLREPRESENTATIONOPTIONS(据span style="color:#A020F0">'学习'据/span>,1E-04,据span style="color:#A020F0">'gradientthreshold'据/span>,1);Actor = RLDETerminyActorRepresentation(Actornetwork,Obsinfo,Actinfo,据span style="color:#A020F0">'观察'据/span>,{据span style="color:#A020F0">“国家”据/span>},据span style="color:#A020F0">'行动'据/span>,{据span style="color:#A020F0">'行动'据/span>},ActorOptions);据/pre>
rlddpgagentoptions.据/code>。据/p>
代理= rlddpgagentoptions(据span style="color:#0000FF">......据/span>'采样时间'据/span>,ts,据span style="color:#0000FF">......据/span>'targetsmoothfactor'据/span>,1e-3,据span style="color:#0000FF">......据/span>'贴花因子'据/span>,1.0,据span style="color:#0000FF">......据/span>'minibatchsize'据/span>,64,据span style="color:#0000FF">......据/span>'经验BufferLength'据/span>,1E6);agentopts.noiseOptions.variance = 0.3;代理.NoiseOptions.varecedecayrate = 1E-5;据/pre>
rlddpgagent.据/code>。据/p>
代理= rlddpgagent(演员,批评者,代理商);据/pre>
火车代理据/h3>
rltringOptions.据/code>。据/p>
maxepisodes = 5000;maxsteps = ceil(tf / ts);训练= rltrainingOptions(据span style="color:#0000FF">......据/span>'maxepisodes'据/span>,maxepisodes,据span style="color:#0000FF">......据/span>'maxstepperepisode'据/span>,maxsteps,据span style="color:#0000FF">......据/span>'scoreaveragingwindowlength'据/span>20,据span style="color:#0000FF">......据/span>'verbose'据/span>,错误的,据span style="color:#0000FF">......据/span>'plots'据/span>那据span style="color:#A020F0">'培训 - 进步'据/span>那据span style="color:#0000FF">......据/span>'stoptrinaincriteria'据/span>那据span style="color:#A020F0">'AverageReward'据/span>那据span style="color:#0000FF">......据/span>'stoptriningvalue'据/span>,800);据/pre>
火车据/code>功能。培训是一个计算密集的过程,需要几分钟才能完成。要在运行此示例的同时节省时间,请通过设置加载预制代理据code class="literal">用圆形据/code>到据code class="literal">错误的据/code>。训练代理人,套装据code class="literal">用圆形据/code>到据code class="literal">真的据/code>。据/p>
dotraining = false;据span style="color:#0000FF">如果据/span>用圆形据span style="color:#228B22">%训练代理人。据/span>Trainstats =火车(代理,env,训练);据span style="color:#0000FF">别的据/span>%加载预磨料的代理。据/span>加载(据span style="color:#A020F0">'watertankddpg.mat'据/span>那据span style="color:#A020F0">'代理人'据/span>)据span style="color:#0000FF">结尾据/span>
验证培训的代理据/h3>
Simopts = RlsimulationOptions(据span style="color:#A020F0">'maxsteps'据/span>,maxsteps,据span style="color:#A020F0">'stoponerror'据/span>那据span style="color:#A020F0">'在'据/span>);体验= SIM(ENV,Agent,Simopts);据/pre>
本地功能据/h3>
功能据/span>在= localresetfcn(in)据span style="color:#228B22">%随机化参考信号据/span>blk = sprintf(据span style="color:#A020F0">'rlwatertank / labled \ nwater等级'据/span>);h = 3 * randn + 10;据span style="color:#0000FF">尽管据/span>H <= 0 ||h> = 20 h = 3 * randn + 10;据span style="color:#0000FF">结尾据/span>在= setBlockParameter(IN,BLK,据span style="color:#A020F0">'价值'据/span>,num2str(h));据span style="color:#228B22">%随机化初始高度据/span>h = 3 * randn + 10;据span style="color:#0000FF">尽管据/span>H <= 0 ||h> = 20 h = 3 * randn + 10;据span style="color:#0000FF">结尾据/span>BLK =据span style="color:#A020F0">'rlwatertank /水箱系统/ h'据/span>;在= setBlockParameter(IN,BLK,据span style="color:#A020F0">'初始条件'据/span>,num2str(h));据span style="color:#0000FF">结尾据/span>