Main Content

rlTRPOAgent

信任区域政策优化强化学习代理

Description

信任区域策略优化(TRPO)是一种无模型的,在线,上政策,政策梯度加强学习方法。该算法通过将更新的策略保留在信托区域内接近当前策略的信托区域中,从而防止了与标准策略梯度方法相比的显着性能下降。动作空间可以是离散的或连续的。

For more information on TRPO agents, seeTrust Region Policy Optimization Agents。For more information on the different types of reinforcement learning agents, see强化学习者

Creation

Description

Create Agent from Observation and Action Specifications

example

agent= rltrpoagent(observationInfo,actionInfo)creates a trust region policy optimization (TRPO) agent for an environment with the given observation and action specifications, using default initialization options. The actor and critic in the agent use default deep neural networks built from the observation specificationobservationInfo和动作规范actionInfo。TheObservationInfoActionInfoproperties ofagentare set to theobservationInfoactionInfoinput arguments, respectively.

example

agent= rltrpoagent(observationInfo,actionInfo,initOpts)creates a TRPO agent for an environment with the given observation and action specifications. The agent uses default networks configured using options specified in theinitOpts目的。TRPO代理不支持复发性神经网络。万博1manbetx有关初始化选项的更多信息,请参见RlagentInitializatizationAptions

Create Agent from Actor and Critic

example

agent= rltrpoagent(actor,critic)creates a TRPO agent with the specified actor and critic, using the default options for the agent.

Specify Agent Options

example

agent= rltrpoagent(___,agentOptions)creates a TRPO agent and sets theAgentOptionsproperty to theagentOptions输入参数。在上一个语法中的任何输入参数之后,请使用此语法。

Input Arguments

expand all

Agent initialization options, specified as anRlagentInitializatizationAptions目的。

TRPO代理不支持复发性神经网络。万博1manbetxThereforeinitOpts.UseRNN一定是false

演员that implements the policy, specified as anrlDiscreteCategoricalActor或者rlContinuousGaussianActor功能近似对象。有关创建演员近似器的更多信息,请参见Create Policies and Value Functions

Critic that estimates the discounted long-term reward, specified as anrlValueFunction目的。有关创建评论家近似器的更多信息,请参阅Create Policies and Value Functions

Properties

expand all

观察规范, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data type, and names of the observation signals.

如果您通过指定演员和评论家来创建代理商,则ObservationInfomatches the value specified in the actor and critic objects.

You can extractobservationInfofrom an existing environment or agent usingGetObservationinfo。您也可以使用rlFiniteSetSpec或者rlNumericSpec

操作规范,指定为强化学习规范对象,定义属性,例如维度,数据类型和操作信号名称。

For a discrete action space, you must specifyactionInfoas anrlFiniteSetSpec目的。

For a continuous action space, you must specifyactionInfoas anrlNumericSpec目的。

如果您通过指定演员和评论家来创建代理商,则ActionInfomatches the value specified in the actor and critic objects.

You can extractactionInfofrom an existing environment or agent usinggetActionInfo。You can also construct the specification manually usingrlFiniteSetSpec或者rlNumericSpec

Agent options, specified as anrlTRPOAgentOptions目的。

使用勘探政策的选项when selecting actions, specified as a one of the following logical values.

  • 真的— Use the base agent exploration policy when selecting actions.

  • false— Use the base agent greedy policy when selecting actions.

指定的代理,样品时间及功率tive scalar or as-1。Setting this parameter to-1allows for event-based simulations. The value ofSampleTimematches the value specified inAgentOptions

Within a Simulink®environment, theRL代理block in which the agent is specified to execute everySampleTime二秒的模拟时间。如果SampleTimeis-1, the block inherits the sample time from its parent subsystem.

Within a MATLAB®environment, the agent is executed every time the environment advances. In this case,SampleTime是输出体验中连续元素之间的时间间隔sim或者train。如果SampleTimeis-1, the time interval between consecutive elements in the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions

train Train reinforcement learning agents within a specified environment
sim 在指定环境中模拟训练有素的加固学习剂
getAction Obtain action from agent or actor given environment observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
生成PolicyFunction Create function that evaluates trained policy of reinforcement learning agent

Examples

collapse all

Create an environment with a discrete action space, and obtain its observation and action specifications. For this example, load the environment used in the exampleCreate Agent Using Deep Network Designer and Train Using Image Observations。该环境有两个观察结果:50 x 50的灰度图像和标量(摆的角速度)。The action is a scalar with five possible elements (a torque of either -2, -1,0,1, or2Nm applied to a swinging pole).

% load predefined environmentenv = rlPredefinedEnv(“简单的pendulumwithimage-discrete”);

Obtain the observation and action specifications for this environment.

obsInfo = getObservationInfo(env); actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. You can ensure reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a TRPO agent from the environment observation and action specifications.

agent = rlTRPOAgent(obsInfo,actInfo);

To check your agent, usegetActionto return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
ans =1x1 cell array{[-2]}

你现在可以测试和列车内的代理vironment.

创建一个具有连续动作空间的环境,并获得其观察和行动规范。对于此示例,加载示例中使用的环境Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation。该环境有两个观察结果:50 x 50的灰度图像和标量(摆的角速度)。该动作是代表扭矩连续的标量。2to2Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

获取有关此环境的观察和行动规范。

obsInfo = getObservationInfo(env); actInfo = getActionInfo(env);

Create an agent initialization options object, specifying that each hidden fully connected layer in the network must have128neurons.

initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);

The agent creation function initializes the actor and critic networks randomly. You can ensure reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

使用指定的初始化选项从环境观察和操作规范中创建TRPO代理。

agent = rlTRPOAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent)); criticNet = getModel(getCritic(agent));

您可以验证网络在其隐藏的完全连接层中有128个单元。例如,显示评论家网络的层。

criticNet.Layers
ans = 11x1层阵列带有层:1'contat'沿尺寸的2个输入的串联串联1 2'relu_body'relu'relu 3'fc_body'完全连接的128完全连接的层4'hody_output'hody_output'relu relu 5'input_1'input_1'input_1'图像输入50x50x15 x10x11图像6'conv_1'卷积64 3x3x1大步[1 1]和填充[0 0 0 0] 7'relu_input_1'relu relu 8'fc_1'完全连接的128完全连接的层9'input_2''input_2'功能输入1特征10'fc_2'fc_2'fc_2''fc_2''fc_2''fc_2''fc_2''完全连接的128完全连接的层11'输出'完全连接1完全连接的层

To check your agent, usegetActionto return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
ans =1x1 cell array{[0.9228]}

你现在可以测试和列车内的代理vironment.

创造一个环境界面,并获得其observation and action specifications.

env = rlPredefinedEnv("CartPole-Discrete");obsInfo = getObservationInfo(env); actInfo = getActionInfo(env);

Create a deep neural network to be used as approximation model within the critic. For PPO agents, the critic estimates a value function, therefore it must take the observation signal as input and return a scalar value.

criticNetwork = [ featureInputLayer(prod(obsInfo.Dimension),。。。'Normalization','none','Name','状态') fullyConnectedLayer(1,'Name','CriticFC');

创建评论家criticNetwork。TRPO agents use anrlValueFunctionobject to implement the critic.

critic = rlValueFunction(criticNetwork,obsInfo);

Set some training options for the critic.

criticOpts = rlOptimizerOptions(。。。'LearnRate',8e-3,'GradientThreshold',1);

Create a deep neural network to be used as approximation model within the actor. For TRPO agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In this case the network must take the observation signal as input and return a probability for each action. Therefore the output layer must have as many elements as the number of possible actions.

actorNetwork = [ featureInputLayer(prod(obsInfo.Dimension),。。。'Normalization','none','Name','状态') fullyConnectedLayer(numel(actInfo.Dimension),。。。'Name','action') ];

Create the actor usingactorNetwork。PPO代理使用rlDiscreteCategoricalActorobject to implement the actor for discrete action spaces.

actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);

Specify agent options, and create a TRPO agent using the environment, actor, and critic.

agentOpts = rlTRPOAgentOptions(。。。'ExperienceHorizon',1024,。。。“折扣器”,0.95,。。。``批评'',criticOpts); agent = rlTRPOAgent(actor,critic,agentOpts)
代理=具有属性的rltrpoagent:agentOptions:[1x1 rl.option.rltrpoagentoptions] useExplorationPolicy:1 ObservationInfo:[1x1 rl.util.rlnumericspec] ActionInfo:[1x1 rl.util.util.rlfinitesetsetspec] SameSemple。

To check your agent, usegetActionto return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})
ans =1x1 cell array{[-10]}

您现在可以测试和训练代理商针对环境。

Create an environment with a continuous action space, and obtain its observation and action specifications. For this example, load the double integrator continuous action space environment used in the exampleTrain DDPG Agent to Control Double Integrator System。来自环境的观察结果是一个载体,其中包含质量的位置和速度。该动作是代表施加到质量的力的标量,连续从 -2to2Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "states" Description: "x, dx" Dimension: [2 1] DataType: "double"
actInfo = getActionInfo(env)
actInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "force" Description: [0x0 string] Dimension: [1 1] DataType: "double"

Since the action must be contained in a limited range, set the upper and lower limit of the action signal accordingly, so you can easily retrieve them when building the actor network.

actInfo.LowerLimit=-2; actInfo.UpperLimit=2;

The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the seed of the random generator.

rng(0)

Create a deep neural network to be used as approximation model within the critic. For TRPO agents, the critic estimates a value function, therefore it must take the observation signal as input and return a scalar value.

criticNet = [ featureInputLayer(prod(obsInfo.Dimension),。。。'Normalization','none','Name','状态')完整连接的layer(10,'Name','fc_in') reluLayer('Name','relu') fullyConnectedLayer(1,'Name','out');

创建评论家criticNet。PPO代理使用rlValueFunctionobject to implement the critic.

评论家= rlvaluefunction(Criticnet,Obsinfo);

Set some training options for the critic.

criticOpts = rlOptimizerOptions(。。。'LearnRate',8e-3,'GradientThreshold',1);

Create a deep neural network to be used as approximation model within the actor. For PPO agents, the actor executes a stochastic policy, which for continuous action spaces is implemented by a continuous Gaussian actor. In this case the network must take the observation signal as input and return both a mean value and a standard deviation value for each action. Therefore it must have two output layers (one for the mean values the other for the standard deviation values), each having as many elements as the dimension of the action space.

Note that standard deviations must be nonnegative and mean values must fall within the range of the action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling layer, to scale the mean values to the output range.

% input path layerinPath = [ featureInputLayer(prod(obsInfo.Dimension),。。。'Normalization','none','Name','状态')完整连接的layer(10,'Name','ip_fc') reluLayer('Name','ip_relu') fullyConnectedLayer(1,'Name','ip_out') ];% path layers for mean valuemeanPath = [ fullyConnectedLayer(15,'Name','mp_fc1') reluLayer('Name','mp_relu') fullyConnectedLayer(1,'Name','mp_fc2');tanhLayer('Name','tanh');scalingLayer('Name','mp_out',。。。'规模',actInfo.UpperLimit) ];% range: (-2N,2N)% path layers for standard deviationsdevPath = [ fullyConnectedLayer(15,'Name','VP_FC1') reluLayer('Name','vp_relu') fullyConnectedLayer(1,'Name','VP_FC2');SOFTPLESLAYER('Name','vp_out') ];%范围:(0,+INF)%将图层添加到layergraph网络对象actornet = layergraph(inter);actornet = addlayers(ActOrnet,Meanpath);actornet = addlayers(actornet,sdevpath);% connect layersactornet = connectlayers(actornet,'ip_out','mp_fc1/in');actornet = connectlayers(actornet,'ip_out','vp_fc1/in');% plot network情节(actornet)

Figure contains an axes object. The axes object contains an object of type graphplot.

Create the actor usingactorNet。PPO代理使用rlContinuousGaussianActorobject to implement the actor for continuous action spaces.

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo,。。。“ actionMeanOutputnames”,'mp_out',。。。'ActionStandardDeviationOutputNames','vp_out',。。。'ObservationInputNames','状态');

Specify agent options, and create a TRPO agent using the actor, critic and agent options.

agentOpts = rlTRPOAgentOptions(。。。'ExperienceHorizon',1024,。。。“折扣器”,0.95,。。。``批评'',criticOpts); agent = rlTRPOAgent(actor,critic,agentOpts)
agent = rlTRPOAgent with properties: AgentOptions: [1x1 rl.option.rlTRPOAgentOptions] UseExplorationPolicy: 1 ObservationInfo: [1x1 rl.util.rlNumericSpec] ActionInfo: [1x1 rl.util.rlNumericSpec] SampleTime: 1

To check your agent, usegetActionto return the action from a random observation.

getAction(agent,{rand(2,1)})
ans =1x1 cell array{[0.6668]}

你现在可以测试和列车内的代理vironment.

Tips

  • 对于连续的动作空间,该代理不会强制执行操作规范设置的约束。在这种情况下,您必须在环境中执行动作空间约束。

  • 在调整Actor网络的学习率是PPO代理所必需的,但对于TRPO代理来说并不是必需的。

  • For high-dimensional observations, such as for images, it is recommended to use PPO, SAC, or TD3 agents.

Version History

Introduced in R2021b