Deep Q-Learning is regular Q-Learning with a slight twist. The best definition of Q-Learning is at StudyWolf. There are 4 articles related to reinforcement learning and I would suggest to read all of them, they are easy to understand and very hands on. For those that don't want to read the article, or are familiar with reinforcement learning the basic definition is:
Q-Learning is a lookup table, all states (S) the agent encounters are stored in this table. When the agent receives a reward (r) the table is updated. It sets the reward value of that action (a) in that state to the reward. Then (as common in reinforcement learning) goes backward and updates the previous state/actions values with the discounted reward (q[s-1] = r*discount). Note: if you're like me and learn from code, I've written quick code at the bottom of this page to show a basic q-learning model.
DQN changes the basic algorithm a little bit. Instead of keeping a table of states, we treat the network as the table. So at each timestep we ask the network to compute the values for actions, the action with the highest value is chosen by the agent (according to an e-greedy policy). E-greedy selects a random action (instead of the highest valued one) with probability (E). When it comes time to train, the only signal the network is given is a positive or negative reward (it's not fed manually discounted values like with temporal difference). For all the states that don't have rewards the network is asked to compute what it thinks are the reward values for the next state (S t+1). Then these predictions are multiplied by the discount and used to train the network. Code for this is on GitHub in my python-dqn project under handlers.experienceHandler.train_exp(). So the signal actually given to the network is only the positive or negative reward, nothing else.
If you have a lot of time to train this algorithm will converge on a good solution hopefully memorizing when it will be rewarded and inferring what actions will get it rewarded in the future.
The issue is the 'a lot of time to train' part. DeepMind ran 5,000,000 updates to get the results they found in their paper. With one update every 4 frames that's 20,000,000 frames!
My implementation is slow (cuda_convnet and windows is still a problem I need to fix which should give about 20% speed up) at about 20FPS on a GTX970 with an i7 (this is when the agent is losing most games quickly so average FPS would be lower). So after 23 days I could get an agent that plays breakout better than a human. David Silver said it took days for them to train so my implementation may not be that bad.
The real question is what are we missing? Why does it take so long? In the next post I'll talk about some ways I believe will improve training and explore these issues in further detail.