Sunday, May 15, 2016

The James-Stein Estimator

This write up is about an estimator. A statistical estimator, is used when you already have a model in mind, data at hand and want to estimate some parameters needed for the model. For example, you want to predict how many runs a batter would make in a game given recent history \(x_1,x_2,x_3,\ldots\) . If we assume (we are making a choice of a model now) that the scores come from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) then the probability density function for a given value \(x\) is

The likelihood that a series of points \(x_1,x_2,x_3,\ldots\) come from such a distribution can be expressed as

Next is basic calculus. Take the logarithm on both sides, set the partial derivative w.r.t. \(\mu\) to zero yields (excluding the algebra)

To verify, you also need to check the second derivative's sign to see if its negative to ensure that it is indeed a maxima you have found.

So a simple estimator would be to use the average runs scored from the past \(n\) days. The average is the result of a "maximum likelihood" approach briefly described above.What if you had 5 players you wanted to estimate the average for? Intuition would tell you that you should simply compute the average for each. Wrong! The James-Stein approach provides a way to pool the data such that the overall error made in estimation is minimized.

Specifically, I'll demonstrate the James-Stein estimator. It is a surprising result discovered by Charles Stein and later formalized by Willard James on estimating such parameters. What makes it stunning is the rather counter intuitive result it demonstrates. James-Stein's estimator approach states that if you wanted to simultaneously estimate a set of independent parameters from data, the maximum likelihood approach is only optimal if you have lesser than 3 parameters to estimate from! If you have more than 3, you are better off using the James-Stein estimator. There is an obvious thought experiment you can think of. If you really wanted just one parameter, could you not simply add an unrelated set of numbers, improve efficiency in estimation and then just use the parameter you are interested in? That's a flaw. The estimator works by minimizing the overall error in estimation so it may make more errors in some of the variables, lesser on others. But overall it will be better than just doing an individual maximum likelihood estimate on each variable. So you could use it if you don't mind seeing slightly bigger errors on some variables, smaller on others but lower overall error.

To see this work, I'll step through the actual algorithm in R. For more details you can refer the wikipedia entry here.The general approach is the following.

  • \(\hat\theta_{js}\) is the James-Stein estimate of the parameters we are interested in computing.
  • \(y\) be the set of values observed.
  • \(\hat{\sigma}^{2}\) be the estimated variance for all parameters
  • \(v\) be the mean of all parameters
then the James Stein estimator is given by

# n.rows is the number of samples we have
n.rows = 100
# n.params is the number of parameters
# we want to estimate
n.params = 30
# wins will hold the number of times
# the JS estimator beats the MLE estimate
wins   = c()
msejs  = c()
msemle = c()
for(iter in seq(1:1000)){
    # Create a sample of parameters
    # They have a range of 20-25
    x.act  = sample(20:25,size=n.params,replace=TRUE)
    # Now create a normal distribution for each of the parameters
    # uncomment below if you want to test it for a poisson distribution
    # m = mapply(rpois,mean=x.act,MoreArgs=list(n = n.rows))
    m = mapply(rnorm,mean=x.act,MoreArgs=list(n = n.rows,sd=10))
    # Find the global mean
    mbar = mean(colMeans(m))
    # Find the column means
    mu0   = colMeans(m)
    # Find global variance
    s2 = var(as.vector(m))/(n.rows*n.params)
    # Compute the adjustment value
    cval    = 1 - ((n.params - 2)*s2/sum((mu0 - mbar)^2))
    # Compute the JS estimate for your parameters
    jsest = mbar + cval *(mu0 - mbar)
    z = data.table(
        actual = x.act,
        mle.est = mu0,
        js.est = jsest)
    # Check to see if the JS estimate is better than MLE
    z[,counter := ifelse(abs(js.est - actual) < abs(mle.est - actual),1,0)]
    # In case you want to see what the numbers are for the
    # difference between absolute and actual estimates for
    # JS and MLE
    z[,jserr  := abs(js.est - actual)]
    z[,mleerr := abs(mle.est - actual)]
    # Record the wins for this iteration of the simulation
    # Repeat.
    wins[iter]   = sum(z$counter)
    msejs[iter]  = sum(z$jserr)
    msemle[iter] = sum(z$mleerr) 
# What are the mean wins? 
# What are the distribution of the mean wins
quantile(wins,prob = seq(0.1,0.9,by=0.1))
z = data.frame(wins = wins)
p = ggplot(z,aes(wins)) +
    geom_histogram(fill='light blue') +
    theme_bw() +
    ggtitle('Of the 30 parameters we wish to estimate:\n how many of them have estimates closer to the actual using the James-Stein estimator than the MLE?') +
    ylab('') +
    xlab('') +
    geom_vline(aes(xintercept=mean(wins),linetype='longdash',colour='blue')) +
    annotate('text',x=18,y=150,label='Mean -->') +

The above R code runs a simulation of sorts. It starts by making some random parameters you would want to estimate and simulates some normally distributed data from it. Next, it uses the James Stein estimator and estimates the very parameters it started off with, using the data. Finally, it compares and records how often the James Stein estimator was better/closer that the MLE estimate. The results from the simulation are shown below.

For kicks, you can try it for a Poisson distribution too, here is what the distribution looks like.

If you are interested in learning more about probability here are a list of books that are good to buy.

Friday, November 28, 2014

Bayesian Cognitive Bias

I chanced on this excellent puzzle on the net that tends to reveal a cognitive bias in our heads against Bayesian reasoning. The puzzle statement is quite simple

You are given four cards. Each card has a letter on one side and number on the other side. You are told the statement "If there is a D on one side, there is a 5 on the other side". Which two cards would you flip over to validate the statement?

The original article is here, think hard before you click through for an answer :)

Tuesday, September 2, 2014

Maximizing Chances in an Unfair Game

submit to reddit

Q: You are about to play a game wherein you flip a biased coin. The coin falls heads with probability \(p\) and tails with \(1 - p\) where \(p \le \frac{1}{2}\). You are forced to play by selecting heads so the game is biased against you. For every toss you make, your opponent gets to toss too. The winner of this game is the one who wins the toss the most. You, however get to choose the number of rounds that get played. Can you ever hope to win?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: At a first look, it might appear that the odds are stacked against you as you are forced to play by choosing heads. You would think that your chances or winning decrease as you play more and more. But, surprisingly there is a way to choose the optimal number of tosses (remember, you get to choose the number of times this game is played). To see how, lets crank out some numbers. If you get to toss the coin \(n\) times, then the total number of coin tosses you and your opponent flips is \(2n\).  Out of the \(2n\) tosses if \(y\) turns out heads, the probability that you would win is
P(\text{y Wins}) = {2n  \choose y} p^{y}(1 - p)^{2n - y}
In order to win, the value of \(y\) should run from \(n + 1\) to \(2n\) and the overall probability works out to
P(\text{Win}) = \sum_{y = n + 1}^{2n}{2n  \choose y} p^{y}(1 - p)^{2n - y}
We can work out the probability of winning by choosing various values of \(p\) and \(n\) and chart them out. Here is the R code that does it.

The code runs pretty quickly and uses the data.table package. All the processed data is contained in variables z and z1. They are plotted using the ggplot package to generate the following charts for the strategy.

The first chart shows the variation of the probability of winning by the number of games played for various probability bias values.

The next chart shows the optimal number of games to play for a given bias probability value.

Some good books to own for learning probability is listed here
Yet another fascinating area of probability are Monte Carlo methods. Here are a list of good books to own to learn Monte Carlo methods.

Tuesday, August 12, 2014

The Best Books for Monte Carlo Methods

The following are some of the best books to own to learn Monte Carlo methods for sampling and estimation problems

Monte Carlo Methods in Statistics (Springer)
This is a good book which discusses both Bayesian methods from a practical point of view as well as theoretical point of view with integrations. The explanations given are also fairly comprehensive. There are also a fair amount of examples in this text. Overall, this is an excellent book to own if you want to understand Monte Carlo sampling methods and algorithms at an intermediate to graduate level.

Explorations in Monte Carlo Methods (Undergraduate Texts in Mathematics)
This is good book to own to get you started on Monte Carlo methods. It starts with fairly simple and basic examples and illustrations. The mathematics used is also fairly basic. Buy this if you are at an undergraduate level and want to get into using Monte Carlo methods but have only a basic knowledge of statistics and probability.

Monte Carlo Methods in Financial Engineering (Stochastic Modelling and Applied Probability) (v. 53)
Another excellent book to own if you are curious to learn about the methods of Monte Carlo in the finance industry. Some really nice areas that are covered in the book include variance reduction techniques, diffusion equations, change point detections, Option pricing methods etc. Ideal for students of financial engineering or ones wanting to break into it. The book tends to overtly rate MC methods (well its a book on MC!).

Monte Carlo Simulation and Resampling Methods for Social Science
This book gives a good introduction and goes over some basic probability theory, statistics and distributions before it hops on to the Monte Carlo methods. This makes it a good introductory book for sampling methods. Recommended for undergraduates with minimal statistical background.

Simulation and Monte Carlo Method

An excellent book to own at the intermediate to graduate level. The text provides a good course in simulation and Monte Carlo methods. Some interesting topics covered in the text include rare-event simulation. The book assumes you have a background in statistics and probability theory.

Sunday, July 6, 2014

Embarrassing Questions, German Tanks and Estimations

submit to reddit

Q: You are conducting a survey and want to ask an embarrassing yes/no question to subjects. The subjects wouldn't answer that embarrassing question honestly unless they are guaranteed complete anonymity. How would you conduct the survey?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: One way to do this is to assign a fair coin to the subject and ask them to toss it in private. If it came out heads then answer the question truthfully else toss the coin a second time and record the result (heads = yes, tails = no). With some simple algebra you can estimate the proportion of users who have answered the question with a yes.

Assume total population surveyed is \(X\). Let \(Y\) subjects have answered with a "yes". Let \(p\) be the sort after proportion. The tree diagram below shows the user flow.

The total expected number of "yes" responses can be estimated as
\frac{pX}{2} + \frac{X}{4} = Y
which on simplification yields
p = \big(\frac{4Y}{X} - 1\big)\frac{1}{2}

Best Books on Probability

Q: A bag contains unknown number of tiles numbered in serial order \(1,2,3,...,n\). You draw \(k\) tiles from the bag without replacement and find the maximum number etched on them to be \(m\). What is your estimate of the number of tiles in the bag?

A: This and problems like these are called German War Tank problems. During WW-II German tanks were numbered in sequential order when they were manufactured. Allied forces needed an estimate of how many tanks were deployed and they had a handful of captured tanks and their serial numbers painted on them. Using this, statisticians estimated the actual tanks to be far lower than what intelligence estimates had them believe. So how does it work?

Let us assume we draw a sample of size \(k\). The maximum in that sample is \(m\). If we estimate the maximum of the population to be \(m\) then probability of the sample maximum to be \(m\) is
P(\text{Sample Max} = m) = \frac{m-1 \choose k-1}{N \choose k}
The \(-1\) figures because the maximum is already taken out of the sample leaving behind \(m - 1\) to choose \(k -1 \) from. The expected value of the maximum using this strategy is thus
E(\text{Maximum}) = \sum_{m=k}^{m=N}m\frac{m-1 \choose k-1}{N \choose k}
Note, we run the above summation from \(k\) to \(N\) as for \(m < k\) the expectation is \(0\) because the sample maximum has to be at least \(k\). After a series of algebraic manipulations ( ref ) the above simplifies to
E(\text{Maximum}) = M\big( 1 + \frac{1}{k}\big) - 1
which is quite an ingenious and simple way to estimate population size given serial number ordering.

If you are looking to buy some books on probability theory here is a good list.

Sunday, June 1, 2014

The Chakravala Algorithm in R

submit to reddit

A class of analysis that has piqued the interest of mathematicians across millennia are Diophantine equations. Diophantine equations are polynomials with multiple variables and seek integer solutions. A special case of Diophantine equations is the Pell's equation. The name is a bit of a misnomer as Euler mistakenly attributed it to the mathematician John Pell. The problem seeks integer solutions to the polynomial
x^{2} - Dy^{2} = 1
Several ancient mathematicians have attempted to study and find generic solutions to Pell's equation. The best known algorithm is the Chakravala algorithm discovered by Bhaskara circa 1114 AD. Bhaskara implicitly credits Brahmagupta (circa 598 AD) for it initial discovery, though some credit it to Jayadeva too. Several Sanskrit words used to describe the algorithm appear to have changed in the 500 years between the two implying other contributors. The Chakravala technique is simple and implementing it in any programming language should be a breeze (credit citation)

Diophantine Equations (Pure & Applied Mathematics)

The method works as follows. Find a trivial solution to the equation. \(x=1,y=0\) can be used all the time. Next, initialize two parameters \([p_i,k_i]\) where \(i\) is an iteration count. \(p_i\) is updated to \(p_{i+1}\) if the following two criteria are satisfied.
  • \(p_i + p_{i+1} mod  k_{i} = 0\) i.e. \(p_i + p_{i+1}\) is divisible by \(k_i\)
  • \(| p_{i+1} - d |^{2}\) is minimized
After updating \(p_{i+1}\) \(k_{i+1}\) is found by evaluating
k_{i+1} = \frac{p_{i+1}^{2} - d}{k_{i}}
and the next pair of values for \([x,y]\) is computed as
x_{i+1} = \frac{p_{i+1}x_i + dy_{i}}{|k_{i}|}
y_{i+1} = \frac{p_{i+1}y_i + x_{i}}{|k_{i}|}
The algorithm also has an easy way to check if the found solution is a solution. It does so by only accepting values where \(k_{i} = 1\).

A screen grab of the entire algorithm done in R is shown below.

A Related Puzzle:
A drawer has \(x\) black socks and \(y\) white socks. You draw two socks consecutively and they are both black. You repeat this several times (by replacing the socks) and find that you get a pair of blacks with probability \(\frac{1}{2}\). You know that there are no more than 30 socks in the draw in total. How many black and white socks are there?

The probability that you would draw two black socks in a row is
P = \frac{x}{x+y}\times\frac{x - 1}{x+y - 1} = \frac{1}{2}
Simplifying and solving for \(x\) yields
x^{2} - (2y + 1)x + y - y^2 = 0
which on further simplification gives
x = \frac{2y + 1 \pm \sqrt{(2y+1)^2 +4(y^2 - y)}}{2}
We can ignore the root with the negative sign as it would yield a negative value for \(x\) which is impossible. The positive root of the quadratic equation yields
x = \frac{2y + 1 + \sqrt{8y^2 + 1}}{2}
For \(x\) to be an integer, the term \(\sqrt{8y^2 + 1}\) has to be an odd integer number \(z\) (say). We can now write it out as
z = \sqrt{8y^2 + 1}
z^{2} - 8y^2 = 1
This is Pell's equation (or Vargaprakriti in Sanskrit).
As we know that there are no more than 30 socks in the draw, we can quickly work our way to two admissible solutions to the problem \(\{3,1\}, \{15,6\}\).
If you are looking to buy books on probability theory here is a good list to own.
If you are looking to buy books on time series analysis here is an excellent list to own.

Wednesday, May 7, 2014

Hopping Robots and Reinforcement Learning

submit to reddit

All too often, when we deal with data the outcome needed is a strategy or an algorithm itself. To arrive at that strategy we may have historic data or some model on how entities in system respond to various situations. In this write up, I'll go over the method of reinforcement learning. The general idea behind reinforcement learning is to come up with a strategy to maximize some measurable goal. For example, if you are modelling a robot that learns to navigate around obstacles, you want the learning process to come back with a strategy that minimizes collisions (say) with other entities in the environment.

Pattern Recognition and Machine Learning (Information Science and Statistics)

For the sake of simplicity, lets assume the following scenario. A robot is placed (at random) on flat plank of wood which has some sticky glue in the center. To its left there is a hole which damages the robot a bit and to its right is a reward which is its destination, as shown in the figure below

The robot has the following capabilities
  • It can sense what state \({S_1,S_2,...,S_7}\) it is in
  • It can detect a reward or damage that happens to it while in a particular state
  • It can move one space left or right
  • It can hop a space on to the right
If the robot ends up in either of the terminal states of \(S_1\) or \(S_7\) it is taken and placed in another state at random. State \(S_1\) does damage to the robot while state \(S_7\) rewards it. \(S_4\) is not exactly a damage causing state but it is an avoidable state and we want to learn that over time. All other states are neither harmful nor rewarding. The robot's goal is to "learn" to get to state \(S_7\) in as few steps as possible.

In order to get reinforcement learning to work, you need to know what the reward values are for each of the states the robot can be in. In this particular example, we will assume the following reward structure for each of the states.
Note, the numbers are fairly arbitrary. In addition to this we need a function or a table, mapping out actions/states pairs leading to new states. Given the robot's movement description above, we can use a table as follows
Zero is being used to designate the scenario when the robot is reset to a random state. With the above two sets of information we are good to start the learning process.

Reinforcement learning works in the following way. You maintain a matrix of values for each state/action pair. This is the reward that can be attained by arriving at a particular state by taking a particular action. But the trick is to also account for what possible future state you can get to, given that you arrive at a given state. For example it may be beneficial to be in a certain state \(X\), but all downstream states from \(X\) may be bad to be in. You want to avoid such states. This way, the algorithm tries to ensure that future favourable and achievable states are taken into account. The algorithm does not immediately update itself based on whatever it learns, it preserves old values and learns gradually. The above methodology can be stated as follows.

If \(Q^{*}(s_t,a_t)\) represents the pay off received by being in state then

Q^{*}(s_t,a_t) = R(s_{t+1},s_t) + \alpha max_{a_t}Q^{*}(s_{t+1},a_{t+1})

To slow down learning a bit, we stick with whatever prior estimate of \(Q^{*}(s_t,a_t)\) we have by a fraction of \(\beta\) as shown below

Q^{*}_{new}(s_t,a_t) = \beta Q^{*}_{prior}(s_t,a_t) + (1 - \beta)Q^{*}(s_t,a_t)
That's it! We now let the system take on various initial states, and let the device play around moving over to different states while we constantly update our \(Q\) matrix. After several iterations, the \(Q\) matrix will end with some values which reflect what strategy to take.

To give a swirl, here is an R code that walks through the entire process for this particular example

A thing to note in the code is how the reward function is encoded. There is a penalty imposed on moving from a higher state to a lower state. This is the simple way to ensure that whatever strategy it comes up with does not involve  going backwards from rightward positional gains that have been made. If you try it without the penalty, you will see cases where the strategy does not care going left or right in some states. The above code when run, creates the following output (screen grab below)

The rows represent the states, the columns represent the three possible actions move-left, move-right and hop-right that the robot can take. Notice what it's trying to say:
  • State 1, do nothing 
  • State 2, move right, but don't hop
  • State 3, don't move right, hop
  • State 4, move right or hop
  • State 5, move right (small chance), hop (definitely)
  • State 6, move right, move left (small chance)
This is exactly what you would expect and want! More on this subject can be read from the following books
Reinforcement Learning: State-of-the-Art (Adaptation, Learning, and Optimization)
Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)
If you are looking to buy some books on probability theory here is a good list to own.