Deep Deterministic Policy Gradient Agents

The deep deterministic policy gradient (DDPG) algorithm is a model-free, online, off-policy reinforcement learning method. A DDPG agent is an actor-critic reinforcement learning agent that searches for an optimal policy that maximizes the expected cumulative long-term reward.

For more information on the different types of reinforcement learning agents, seeReinforcement Learning Agents。

DDPG agents can be trained in environments with the following observation and action spaces.

Observation Space	Action Space
Continuous or discrete	Continuous

DDPG agents use the following actor and critic.

Critic	Actor
Q-value function criticQ(S,A), which you create using`rlQValueFunction`	Deterministic policy actorπ(S), which you create using`rlContinuousDeterministicActor`

During training, a DDPG agent:

Updates the actor and critic properties at each time step during learning.
Stores past experiences using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.
Perturbs the action chosen by the policy using a stochastic noise model at each training step.

Actor and Critic Functions

To estimate the policy and value function, a DDPG agent maintains four function approximators:

Actorπ(S;θ)— The actor, with parametersθ, takes observationSand returns the corresponding action that maximizes the long-term reward.
Target actorπ_t(S;θ_t) — To improve the stability of the optimization, the agent periodically updates the target actor parametersθ_tusing the latest actor parameter values.
CriticQ(S,A;ϕ) — The critic, with parametersϕ, takes observationSand actionAas inputs and returns the corresponding expectation of the long-term reward.
Target criticQ_t(S,A;ϕ_t) — To improve the stability of the optimization, the agent periodically updates the target critic parametersϕ_tusing the latest critic parameter values.

BothQ(S,A;ϕ) andQ_t(S,A;ϕ_t) have the same structure and parameterization, and bothπ(S;θ) andπ_t(S;θ_t) have the same structure and parameterization.

For more information on creating actors and critics for function approximation, seeCreate Policies and Value Functions。

During training, the agent tunes the parameter values inθ。After training, the parameters remain at their tuned value and the trained actor function approximator is stored inπ(S).

Agent Creation

You can create and train DDPG agents at the MATLAB^®command line or using theReinforcement Learning Designerapp. For more information on creating agents usingReinforcement Learning Designer, seeCreate Agents Using Reinforcement Learning Designer。

At the command line, you can create a DDPG agent with default actor and critics based on the observation and action specifications from the environment. To do so, perform the following steps.

Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications usinggetObservationInfo。
Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications usinggetActionInfo。
If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object usingrlAgentInitializationOptions。
If needed, specify agent options using anrlDDPGAgentOptionsobject.
Create the agent using anrlDDPGAgentobject.

Alternatively, you can create actor and critic and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment.

Create an actor using anrlContinuousDeterministicActorobject.
Create a critic using anrlQValueFunctionobject.
Specify agent options using anrlDDPGAgentOptionsobject.
Create the agent using anrlDDPGAgentobject.

For more information on creating actors and critics for function approximation, seeCreate Policies and Value Functions。

Training Algorithm

DDPG agents use the following training algorithm, in which they update their actor and critic models at each time step. To configure the training algorithm, specify options using anrlDDPGAgentOptionsobject.

Initialize the criticQ(S,A;ϕ) with random parameter valuesϕ, and initialize the target critic parametersϕ_twith the same values: $ϕ_{t} = ϕ$ 。
Initialize the actorπ(S;θ) with random parameter valuesθ, and initialize the target actor parametersθ_twith the same values: $θ_{t} = θ$ 。
For each training time step:
1. For the current observationS, select actionA=π(S;θ) +N, whereNis stochastic noise from the noise model. To configure the noise model, use theNoiseOptionsoption.
2. Execute actionA。Observe the rewardRand next observationS'。
3. Store the experience (S,A,R,S') in the experience buffer. The length of the experience buffer is specified in theExperienceBufferLengthproperty of therlDDPGAgentOptionsobject.
4. Sample a random mini-batch ofMexperiences (S_i,A_i,R_i,S'_i) from the experience buffer. To specifyM, use theMiniBatchSizeproperty of therlDDPGAgentOptionsobject.
5. IfS'_iis a terminal state, set the value function targety_itoR_i。Otherwise, set it to
  
  $y_{i} = R_{i} + γ Q_{t} (S_{i}', π_{t} (S_{i}'; θ_{t}); ϕ_{t})$
  
  The value function target is the sum of the experience rewardR_iand the discounted future reward. To specify the discount factorγ, use theDiscountFactoroption.
  To compute the cumulative reward, the agent first computes a next action by passing the next observationS'_ifrom the sampled experience to the target actor. The agent finds the cumulative reward by passing the next action to the target critic.
6. Update the critic parameters by minimizing the lossLacross all sampled experiences.
  
  $L = \frac{1}{M} \sum_{i = 1}^{M} {(y_{i} - Q (S_{i}, A_{i}; ϕ))}^{2}$
7. Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.
  
  $\begin{array}{l} \nabla_{θ} J \approx \frac{1}{M} \sum_{i = 1}^{M} G_{a i} G_{π i} \\ G_{a i} = \nabla_{A} Q (S_{i}, A; ϕ) where A = π (S_{i}; θ) \\ G_{π i} = \nabla_{θ} π (S_{i}; θ) \end{array}$
  
  Here,G_aiis the gradient of the critic output with respect to the action computed by the actor network, andG_πiis the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observationS_i。
8. Update the target actor and critic parameters depending on the target update method. For more information seeTarget Update Methods。

For simplicity, the actor and critic updates in this algorithm show a gradient update using basic stochastic gradient descent. The actual gradient update method depends on the optimizer you specify using in therlOptimizerOptionsobject assigned to therlCriticOptimizerOptionsproperty.

Target Update Methods

DDPG agents update their target actor and critic parameters using one of the following target update methods.

Smoothing— Update the target parameters at every time step using smoothing factorτ。指定smoothing factor, use theTargetSmoothFactoroption.

$\begin{array}{l} ϕ_{t} = τ ϕ + (1 - τ) ϕ_{t} (critic parameters) \\ θ_{t} = τ θ + (1 - τ) θ_{t} (actor parameters) \end{array}$
Periodic— Update the target parameters periodically without smoothing (TargetSmoothFactor = 1). To specify the update period, use theTargetUpdateFrequencyparameter.
Periodic Smoothing— Update the target parameters periodically with smoothing.

To configure the target update method, create arlDDPGAgentOptionsobject, and set theTargetUpdateFrequencyandTargetSmoothFactorparameters as shown in the following table.

Update Method	`TargetUpdateFrequency`	`TargetSmoothFactor`
Smoothing (default)	`1`	Less than`1`
Periodic	Greater than`1`	`1`
Periodic smoothing	Greater than`1`	Less than`1`

References

[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.”ArXiv:1509.02971 [Cs, Stat], September 9, 2015.https://arxiv.org/abs/1509.02971。