Monday, September 21, 2015

Analyzing Deep Q-Learning

Deep Q-Learning is regular Q-Learning with a slight twist. The best definition of Q-Learning is at StudyWolf. There are 4 articles related to reinforcement learning and I would suggest to read all of them, they are easy to understand and very hands on. For those that don't want to read the article, or are familiar with reinforcement learning the basic definition is:
Q-Learning is a lookup table, all states (S) the agent encounters are stored in this table. When the agent receives a reward (r) the table is updated. It sets the reward value of that action (a) in that state to the reward. Then (as common in reinforcement learning) goes backward and updates the previous state/actions values with the discounted reward (q[s-1] = r*discount). Note: if you're like me and learn from code, I've written quick code at the bottom of this page to show a basic q-learning model.

DQN changes the basic algorithm a little bit. Instead of keeping a table of states, we treat the network as the table. So at each timestep we ask the network to compute the values for actions, the action with the highest value is chosen by the agent (according to an e-greedy policy). E-greedy selects a random action (instead of the highest valued one) with probability (E). When it comes time to train, the only signal the network is given is a positive or negative reward (it's not fed manually discounted values like with temporal difference). For all the states that don't have rewards the network is asked to compute what it thinks are the reward values for the next state (S t+1). Then these predictions are multiplied by the discount and used to train the network. Code for this is on GitHub in my python-dqn project under handlers.experienceHandler.train_exp(). So the signal actually given to the network is only the positive or negative reward, nothing else.

If you have a lot of time to train this algorithm will converge on a good solution hopefully memorizing when it will be rewarded and inferring what actions will get it rewarded in the future.

The issue is the 'a lot of time to train' part. DeepMind ran 5,000,000 updates to get the results they found in their paper. With one update every 4 frames that's 20,000,000 frames!
My implementation is slow (cuda_convnet and windows is still a problem I need to fix which should give about 20% speed up) at about 20FPS on a GTX970 with an i7 (this is when the agent is losing most games quickly so average FPS would be lower). So after 23 days I could get an agent that plays breakout better than a human. David Silver said it took days for them to train so my implementation may not be that bad.

The real question is what are we missing? Why does it take so long? In the next post I'll talk about some ways I believe will improve training and explore these issues in further detail.

qtable = dict()

while world.is_not_goal_state():
    state = world.get_state()
    # the line below does not take into account an e-greedy policy
    action = np.argmax(qtable[state], axis=1)
    world.take_action(action)
    rew = world.get_reward()
    if rew != 0:
        update_q(state, action, rew)
    

11 comments:

  1. It tends to be a blend of the two called the Hybrid methodology or it very well may be a unified methodology. data science course in pune

    ReplyDelete
  2. Well, the most on top staying topic is Data Science. Data science is one of the most promising technique in the growing world. I would like to add Data science training to the preference list. Out of all, Data science course in Mumbai is making a huge difference all across the country. Thank you so much for showing your work and thank you so much for this wonderful article.
    Data science course in Mumbai

    ReplyDelete
  3. Just saying thanks will not just be sufficient, for the fantastic lucidity in your writing. I will instantly grab your articles to get deeper into the topic. And as the same way ExcelR also helps organisations by providing digital marketing training based on practical knowledge and theoretical concepts. It offers the best value in training services combined with the support of our creative staff to provide meaningful solution that suits your learning needs.

    ReplyDelete
  4. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai

    ReplyDelete
  5. This is also a very good post which I really enjoyed reading. It is not every day that I have the possibility to see something like this..
    Data science course in mumbai

    ReplyDelete
  6. Such a very useful article. Very interesting to read this article. I have learn some new information.thanks for sharing. ExcelR

    ReplyDelete
  7. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!
    ExcelR Business Analytics Course

    ReplyDelete
  8. Very nice blog here and thanks for post it.. Keep blogging...
    ExcelR data science training

    ReplyDelete
  9. Attend The Analytics Course in Bangalore From ExcelR. Practical Analytics Course in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Analytics Course in Bangalore.
    ExcelR Analytics Course in Bangalore

    ReplyDelete
  10. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.
    ExcelR data science course in mumbai

    ReplyDelete