Article

# Understanding Deep Q-Learning

In this article, we discuss a common approach to reinforcement learning, which concerns controlling systems that change over time and broadly includes applications such as self-driving cars, robotics, and bots for games.

## Introduction

In any game, the player's goal is to maximize their score, which we refer to in this guide as the *reward*. To maximize reward, the player must be able to refine its decision-making abilities. Formally, a decision is the
process of looking at the game, or observing the game's state, and picking an action. Our decision-making function is called a *policy*. A policy accepts a state as input and "decides" on an action:

$$\pi: \text{state} \rightarrow \text{action}$$

This is the basis of reinforcement learning, a subfield within control theory, which concerns controlling systems that change over time and broadly includes applications such as self-driving cars, robotics, and bots for games.

## Q-Learning

To build such a function, we will start with a specific set of algorithms in reinforcement learning called *q-learning algorithms*. Consider the initial state of a game, which we will call `state0`

: your spaceship and
the aliens are all in their starting positions. Assume we have access to a magical "Q-table", which tells us how much reward each action will earn:

state | action | reward |
---|---|---|

state0 | shoot | 10 |

state0 | right | 3 |

state0 | left | 3 |

The `shoot`

action will maximize your reward, as it results in the reward with the highest value (10). As you can see, a Q-table provides a straightforward way to make decisions, based on the observed state:

$$\pi(\text{state}) = \text{look at Q-table, pick action with greatest reward}$$

However, most games have too many states to list in a table. Then, the q-learning agent learns a *Q-function* instead of a Q-table. We use this Q-function like we used the Q-table above. Rewriting the table as a function, we have
the following:

```
Q(state0, shoot) = 10
Q(state0, right) = 3
Q(state0, left) = 3
```

Given a particular state, it's easy for us to make a decision: we simply look at each possible action and its reward, then take the action that corresponds with the largest expected reward. More succinctly, we take the following:

$$\pi(\text{state}) = \text{argmax}_{\text{action}} Q(\text{state}, \text{action})$$

This function $\pi$ satisfies the requirements of a decision-making function: given a state in the game, it decides on an action. However, this solution depends on knowing `Q(state, action)`

for every state and action. To
estimate `Q(state, action)`

, use the following intuition: Say we observe an agent for 3 timesteps.

- The agent plays
`(state0, shoot)`

and obtains a reward of 10. - It repeats
`(state0, shoot)`

and obtains 15. - It repeats once more and obtains 5.
- Our goal is to produce an estimate of the "true reward",
`Q(state0, shoot)`

, using the three rewards--10, 15, 5. One natural estimate of`Q(state0, shoot)`

is then the average of all 3 values, making`Q(state0, shoot) = 10`

.

The average is reasonable, but our goal is to obtain a running estimate of `Q(state, action)`

. In short, a proper running estimate leads to more efficient training: A running estimate of `Q(state, action)`

will
yield a better agent during training. A better agent during training will lead to exploration concentrated around promising actions.

However, this running estimate does not have access to all rewards. Instead, it only has access to one reward value at a time. We will modify the algorithm to iteratively update a running estimate.

- Initialize
`Q(state0, shoot) = 0`

. - As before, observe the agent play
`(state0, shoot)`

for a reward of 10. - Now, take the average between the old estimate 0 and the current reward 10 for the new Q-value,
`Q(state0, shoot) = 5`

. - Again, observe the agent play
`(state0, shoot)`

for a reward of 15. - Now, take the average between 5 and 15 to obtain
`Q(state0, shoot) = 10`

.

This running average forms the basis of our approach to estimating Q-values, `Q(state, action)`

. Rewritten formally, the 5 steps above are based on the following equation.

$$Q(\text{state}, \text{action}) \leftarrow 0.5 Q(\text{state}, \text{action}) + 0.5 \text{reward}$$

Further modify this algorithm: Instead of taking an average between the reward and current Q-value, `Q(state, action)`

, take a weighted average. Above, effectively replace 0.5 with a parameter $\alpha$ that controls learning
rate. A higher value of $\alpha$ means Q-values depend more heavily on recent rewards.

$$Q(\text{state}, \text{action}) \leftarrow (1-\alpha) Q(\text{state}, \text{action}) + \alpha \text{reward}$$

To simplify the discussion below, define an additional variable to obtain an equivalent formulation of the above equation:

$$Q_\text{target} = \text{reward}$$ $$Q(\text{state}, \text{action}) \leftarrow (1-\alpha) Q(\text{state}, \text{action}) + \alpha Q_\text{target}$$

Consider a game with delayed rewards; for example, in SpaceInvaders, the player is rewarded when the alien is blown up and not when the player shoots. However, the player shooting is the true impetus for a reward. Somehow, the
Q-function must assign `(state0, shoot)`

a positive reward.

To propagate rewards through time, modify the definition of `Qtarget`

; we consider `Qtarget`

to be the reward for `(state, action)`

*and* the best possible reward for a second action ```
(state',
action')
```

. This is effectively a two-step lookahead. Modify the definition of `Qtarget`

above to obtain:

$$Q_\text{target} = \text{reward} + \max_\text{action'} Q(\text{state}', \text{action}')$$

Finally, if the goal state yields reward 100, neighboring states should not also incur a reward of 100. Neighbors of neighboring states should likewise not incur a reward of 100. Instead, neighboring states should incur a reward of 90,
and neighbors of neighboring states should incur a reward of 81 etc. In this way, the agent is incentivized to move closer and closer to the goal state. To effect this, we add a *discount factor* $\gamma$ between 0 and 1. In the
example in the previous sentence, $\gamma = 0.9$.

$$Q_\text{target} = \text{reward} + \gamma \max_\text{action'} Q(\text{state}', \text{action}')$$

To summarize, we will introduce some simplifying notation. For our current state, $s$, we pick an action, $a$. Once we have the reward and next state, we can update the Q-value associated with our current state and action. We will make the following abbreviations:

- $s$: the state at current time step
- $a$: the action taken at current time step
- $r$: the reward for current time step
- $s'$: the new state for next time step, given that we took action $a$
- $a'$: all possible actions
- $\alpha$: the learning rate
- $\gamma$: the discount factor, how much reward "degrades" as we propagate it

We'll start by taking the reward at this time step and then add a discounted best-next-state reward so that rewards can propagate through time, as discussed previously. The sum of these two values is our target q-value. In one sense, this is our "true" reward.

$$Q_{target} = r + \gamma \max_{a'} Q(s', a')$$

We then take a weighted average of both our old and new values when updating our Q-values.

$$Q(s, a) \leftarrow (1-\alpha)Q(s, a) + \alpha Q_{target}$$

## Deep Q-Learning

In *Deep* Reinforcement Learning, we use a neural network to approximate the Q function. Recall the update rule described earlier:

$$Q(s, a) \leftarrow (1-\alpha)Q(s, a) + \alpha (r + \gamma \max_{a'} Q(s', a'))$$

In order to update the Q function, we need a neural network that approximates Q-values for our current state $s$, $Q(s, a)$. One way to accomplish this is to have a neural network accept both state and action as inputs and then output the Q-value. However, evaluating $max_{a'} Q(s', a')$ is CPU-intensive: if we have $n_a$ actions, we will need to run the neural network $n_a$ times. Instead, we create a neural network that accepts only the state, and outputs a vector of $n$ Q-values, one for each available action. Now, evaluating $max_{a'} Q(s', a')$ is much more efficient: we only need to evaluate the neural network once.

To do this, we'll apply one-hot encoding. *One-hot encoding* is a method of translating categories into vectors: say there are 3 classes. The first class is denoted using a vector `[1, 0, 0]`

, the second class
using `[0, 1, 0]`

, and the third using `[0, 0, 1]`

. We can encode the state in a vector such as these. If you'd like to learn more about one-hot encoding, we encourage you to read Adversarial
Examples in Computer Vision: How to Build then Fool an Emotion-Based Dog Filter.

However, another question remains: How will we obtain labels we can use to train the q-learning network? We will pass the current state into our neural network to obtain a vector of Q-values for each action. We'll take an action according to this by taking the action that maximizes our Q-value. Then, $Q_{target} = r + \gamma \max_{a'} Q(s', a')$. This value $Q_{target}$ is then the label we'll use to train the Q-network with.

Say we tuned the previous q-learning algorithm's model complexity and sample complexity perfectly, regardless of whether we picked a neural network or least squares method. As it turns out, this naive q-learning agent still performs poorly on more complex games, even with an exceedingly high number of training episodes. We will discuss two techniques that can improve performance. The original deep q-learning network (DQN) paper by DeepMind recognized two issues

- Correlated states: Take the state of our game at time 0, which we will call $s_0$. Say we update $Q(s_0, \cdot)$, according to the rules we derived above. Now, take the state at time 1, which we call $s_1$. We also update $Q(s_1,
\cdot)$ according to rules above. Note that the game's state at time 0 is very similar to its state at time 1. In SpaceInvaders, the aliens may have moved by one pixel each, for example. Said more succinctly, $s_0$ and $s_1$ are
very similar. We also expect $Q(s_0, \cdot)$ and $Q(s_1, \cdot)$ to be very similar, so updating one affects the other. This leads to fluctuating $Q$ values, as an update to $Q(s_0, \cdot)$ may in fact counter the update to $Q(s_1,
\cdot)$. More formally, $s_0$ and $s_1$ are
*correlated*. Since the $Q$ function is deterministic, $Q(s_1, \cdot)$ is correlated with $Q(s_0, \cdot)$. - Q-function instability: Recall that the $Q$ function is both the model we train and the source of our labels. Say that our labels are randomly-selected values that truly represent a
*distribution*$\mathcal{L}$. Every time we update $Q$, we change $\mathcal{L}$. This means that our model is trying to learn a moving target $\mathcal{L}$; this is an issue, as the models we use assume a fixed distribution.

The authors of this paper then contributed two solutions to combat correlated states and an unstable Q-function:

- To combat correlated states, we keep a
*replay buffer*, a list of states. Each time step, we add the game state that we observe to this replay buffer. We also randomly sample a subset of states from this list, and train on those states. - First, Mnih et. al duplicated $Q(s, a)$. One is called $Q_{current}(s, a)$, which is the Q-function you update. You need another Q-function for successor states, $Q_{target}(s', a')$, which you won't update. Recall $Q_{target}(s', a')$ is used to generate your labels. By separating $Q_{current}$ from $Q_{target}$ and fixing the latter, you fix the distribution your labels are sampled from. Then, your deep learning model can spend a short period learning this distribution. After a period of time, you then re-duplicate $Q_{current}$ for a new $Q_{target}$.

With these insights, deep q-learning saw improvements in many orders of magnitude on several benchmarks. How meaningful these benchmarks are and thus the improvements on these benchmarks are a debated topic.