Main Content

Q-Learning Agents

The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-learning agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards.

For more information on the different types of reinforcement learning agents, seeReinforcement Learning Agents

Q-learning agents can be trained in environments with the following observation and action spaces.

Observation Space Action Space
连续或离散 Discrete

Q agents use the following critic representation.

Critic Actor

Q-value function criticQ(S,A), which you create usingrlQValueRepresentation

Q agents do not use an actor.

During training, the agent explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probabilityϵ, otherwise it selects an action greedily with respect to the value function with probability 1-ϵ。This greedy action is the action for which the value function is greatest.

Critic Function

To estimate the value function, a Q-learning agent maintains a criticQ(S,A), which is a table or function approximator. The critic takes observationSand actionAas inputs and returns the corresponding expectation of the long-term reward.

For more information on creating critics for value function approximation, seeCreate Policy and Value Function Representations

When training is complete, the trained value function approximator is stored in criticQ(S,A).

Agent Creation

To create a Q-learning agent:

  1. Create a critic using anrlQValueRepresentationobject.

  2. Specify agent options using anrlQAgentOptionsobject.

  3. Create the agent using anrlQAgentobject.

Training Algorithm

Q-learning agents use the following training algorithm. To configure the training algorithm, specify options using anrlQAgentOptionsobject.

  • Initialize the criticQ(S,A) with random values.

  • For each training episode:

    1. Set the initial observationS

    2. Repeat the following for each step of the episode untilSis a terminal state.

      1. For the current observationS, select a random actionAwith probabilityϵ。否则,选择评论家价值功能最大的动作。

        A = arg max A Q ( S , A )

        To specifyϵand its decay rate, use theEpsilonGreedyExplorationoption.

      2. Execute actionA。Observe the rewardRand next observationS'

      3. IfS'is a terminal state, set the value function targetytoR。Otherwise, set it to

        y = R + γ max A Q ( S ' , A )

        To set the discount factorγ, use theDiscountFactoroption.

      4. Compute the critic parameter update.

        Δ Q = y Q ( S , A )

      5. Update the critic using the learning rateα

        Q ( S , A ) = Q ( S , A ) + α Δ Q

        Specify the learning rate when you create the critic representation by setting theLearnRateoption in therlRepresentationOptionsobject.

      6. Set the observationStoS'

See Also

|

相关话题