Skip to main content

Hopping Robots and Reinforcement Learning

submit to reddit

All too often, when we deal with data the outcome needed is a strategy or an algorithm itself. To arrive at that strategy we may have historic data or some model on how entities in system respond to various situations. In this write up, I'll go over the method of reinforcement learning. The general idea behind reinforcement learning is to come up with a strategy to maximize some measurable goal. For example, if you are modelling a robot that learns to navigate around obstacles, you want the learning process to come back with a strategy that minimizes collisions (say) with other entities in the environment.

Pattern Recognition and Machine Learning (Information Science and Statistics)

For the sake of simplicity, lets assume the following scenario. A robot is placed (at random) on flat plank of wood which has some sticky glue in the center. To its left there is a hole which damages the robot a bit and to its right is a reward which is its destination, as shown in the figure below

The robot has the following capabilities
  • It can sense what state \({S_1,S_2,...,S_7}\) it is in
  • It can detect a reward or damage that happens to it while in a particular state
  • It can move one space left or right
  • It can hop a space on to the right
If the robot ends up in either of the terminal states of \(S_1\) or \(S_7\) it is taken and placed in another state at random. State \(S_1\) does damage to the robot while state \(S_7\) rewards it. \(S_4\) is not exactly a damage causing state but it is an avoidable state and we want to learn that over time. All other states are neither harmful nor rewarding. The robot's goal is to "learn" to get to state \(S_7\) in as few steps as possible.

In order to get reinforcement learning to work, you need to know what the reward values are for each of the states the robot can be in. In this particular example, we will assume the following reward structure for each of the states.
Note, the numbers are fairly arbitrary. In addition to this we need a function or a table, mapping out actions/states pairs leading to new states. Given the robot's movement description above, we can use a table as follows
Zero is being used to designate the scenario when the robot is reset to a random state. With the above two sets of information we are good to start the learning process.

Reinforcement learning works in the following way. You maintain a matrix of values for each state/action pair. This is the reward that can be attained by arriving at a particular state by taking a particular action. But the trick is to also account for what possible future state you can get to, given that you arrive at a given state. For example it may be beneficial to be in a certain state \(X\), but all downstream states from \(X\) may be bad to be in. You want to avoid such states. This way, the algorithm tries to ensure that future favourable and achievable states are taken into account. The algorithm does not immediately update itself based on whatever it learns, it preserves old values and learns gradually. The above methodology can be stated as follows.

If \(Q^{*}(s_t,a_t)\) represents the pay off received by being in state then

Q^{*}(s_t,a_t) = R(s_{t+1},s_t) + \alpha max_{a_t}Q^{*}(s_{t+1},a_{t+1})

To slow down learning a bit, we stick with whatever prior estimate of \(Q^{*}(s_t,a_t)\) we have by a fraction of \(\beta\) as shown below

Q^{*}_{new}(s_t,a_t) = \beta Q^{*}_{prior}(s_t,a_t) + (1 - \beta)Q^{*}(s_t,a_t)
That's it! We now let the system take on various initial states, and let the device play around moving over to different states while we constantly update our \(Q\) matrix. After several iterations, the \(Q\) matrix will end with some values which reflect what strategy to take.

To give a swirl, here is an R code that walks through the entire process for this particular example

A thing to note in the code is how the reward function is encoded. There is a penalty imposed on moving from a higher state to a lower state. This is the simple way to ensure that whatever strategy it comes up with does not involve  going backwards from rightward positional gains that have been made. If you try it without the penalty, you will see cases where the strategy does not care going left or right in some states. The above code when run, creates the following output (screen grab below)

The rows represent the states, the columns represent the three possible actions move-left, move-right and hop-right that the robot can take. Notice what it's trying to say:
  • State 1, do nothing 
  • State 2, move right, but don't hop
  • State 3, don't move right, hop
  • State 4, move right or hop
  • State 5, move right (small chance), hop (definitely)
  • State 6, move right, move left (small chance)
This is exactly what you would expect and want! More on this subject can be read from the following books
Reinforcement Learning: State-of-the-Art (Adaptation, Learning, and Optimization)
Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)
If you are looking to buy some books on probability theory here is a good list to own.


Popular posts from this blog

The Best Books to Learn Probability

If you are looking to buy some books in probability here are some of the best books to learn the art of Probability

The Probability Tutoring Book: An Intuitive Course for Engineers and Scientists (and Everyone Else!)
A good book for graduate level classes: has some practice problems in them which is a good thing. But that doesn't make this book any less of buy for the beginner.

An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd Edition
This is a two volume book and the first volume is what will likely interest a beginner because it covers discrete probability. The book tends to treat probability as a theory on its own

Discovering Statistics Using R
This is a good book if you are new to statistics & probability while simultaneously getting started with a programming language. The book supports R and is written in a casual humorous way making it an easy read. Great for beginners. Some of the data on the companion website could be missing.

Fifty Challenging Probl…

The Best Books for Time Series Analysis

If you are looking to learn time series analysis, the following are some of the best books in time series analysis.

Introductory Time Series with R (Use R!)
This is good book to get one started on time series. A nice aspect of this book is that it has examples in R and some of the data is part of standard R packages which makes good introductory material for learning the R language too. That said this is not exactly a graduate level book, and some of the data links in the book may not be valid.

A great book if you are in an economics stream or want to get into it. The nice thing in the book is it tries to bring out a oneness in all the methods used. Econ majors need to be up-to speed on the grounding mathematics for time series analysis to use this book. Outside of those prerequisites, this is one of the best books on econometrics and time series analysis.

Pattern Recognition and Machine Learning (Information Science and Statistics)
This is excelle…

Drawing Aces from a Deck

Q: What is the average number of cards you need to draw from a well shuffled deck of cards before you get an Ace?

The Moscow Puzzles: 359 Mathematical Recreations (Dover Recreational Math)

A:There is a lot of discussion on the web on using hypergeometric distributions for solving these kind of problems. The hypergeometric distribution is just a big word for something fairly simple. Here is the wikipedia explanation for it (link).
But an easier "smarter" way to solve this puzzle (along with ones that fit this framework) is to work with expectations and indicator random variables. Indicator random variables are like on/off switches. They are 1 under certain conditions, 0 otherwise.

Let us assume that the deck of cards (52 total) is as follows, it has 4 aces and all others are labelled X. Let \(Z_i\) represent the indicator variable for a card in position \(i\) with value set as 1 when all aces are in behind it, 0 otherwise.

The total number of draws will then be

$$ N = \sum_{i=1…