Wednesday, October 26, 2016

The Magic of Numba and Python

Python is a great programming language. It's primary merits are readability and the numerous packages that are available online. However, being an interpreted language the speed of execution is always an issue. Here is a simple example of a piece of code written in Python that tries to add two numbers from a grid.


def overlap_pp(x,y):
    count = 0
    for i in range(x):
        for j in range(y):
            count += i + j
    return count

for _ in range(1000):
    q = overlap_pp(500,500)

If you run the above script (saved as on the command line terminal with the time command you should see some numbers like the below

time ./

real 0m22.379s
user 0m22.363s
sys 0m0.407s

The process running on one processor took about 22 seconds to complete the run. Now lets do something seemingly magical. Lets add a decorator. Decorators are python speak for a mechanism to do wrapper functions. The decorator we are going to add is '@jit'. The code looks like as it is shown below


from numba import jit

def overlap_pp(x,y):
    count = 0
    for i in range(x):
        for j in range(y):
            count += i + j
    return count

for _ in range(1000):
    q = overlap_pp(500,500)

Now let's time this as before.

time ./

real 0m0.420s
user 0m0.370s
sys 0m0.443s

Magic! The exact same code now runs 55x faster! This happens because the Just-In-Time (jit) feature of the numba package compiles the function to machine code on the fly. The timing you see is almost in line with what you would expect from a similar code done in C or Fortran. In fact you will be surprised that in this particular example, the numba version runs faster than the C version! The C code is shown below.
#include < stdio.h >

int overlap_pp(int x,int y){
  int i,j,q;
  for(i=0;i<= x;i++){
    for(j=0;j<= y;j++){
      q += i + j;

int main(int argc,char** argv){
  int i,q;
  for(i = 0;i<=1000;i++){
    q = overlap_pp(500,500);
  return 0;

The timing results after compiling the above code 'gcc t.c -o t'

time ./t

real 0m0.774s
user 0m0.774s
sys 0m0.000s

Unfortunately there isn't a clear description online on how to get Numba installed on Ubuntu 14.04. After some hacking and poking around stackoverflow the following worked for me.

Step 1:
Numba uses llvm. You need to remove all llvm, llvm-config and llvmlite installations that you already have on your system. Numba needs llvm-3.8 upwards to work properly.

Step 2:
Follow instructions in this stackoverflow thread. But when you get installing do the following
 sudo apt-get install zlib1g zlib1g-dev libedit2 libedit-dev llvm-3.8 llvm-3.8-dev
Notice change in version number to 3.8.

Step 3:
Next, install llvmlite. You also want to create a symbolic link on '/usr/bin/' to point to the latest 3.8 version of llvm-config.
"sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config"
This step is important because the installation of numba appears to use llvm-config. After this you can install Numba directly using pip.

If you are interested in learning more about python programming, data science and probability here are a list of books that are worthwhile to buy.

Sunday, May 15, 2016

The James-Stein Estimator

This write up is about an estimator. A statistical estimator, is used when you already have a model in mind, data at hand and want to estimate some parameters needed for the model. For example, you want to predict how many runs a batter would make in a game given recent history \(x_1,x_2,x_3,\ldots\) . If we assume (we are making a choice of a model now) that the scores come from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) then the probability density function for a given value \(x\) is

The likelihood that a series of points \(x_1,x_2,x_3,\ldots\) come from such a distribution can be expressed as

Next is basic calculus. Take the logarithm on both sides, set the partial derivative w.r.t. \(\mu\) to zero yields (excluding the algebra)

To verify, you also need to check the second derivative's sign to see if its negative to ensure that it is indeed a maxima you have found.

So a simple estimator would be to use the average runs scored from the past \(n\) days. The average is the result of a "maximum likelihood" approach briefly described above.What if you had 5 players you wanted to estimate the average for? Intuition would tell you that you should simply compute the average for each. Wrong! The James-Stein approach provides a way to pool the data such that the overall error made in estimation is minimized.

Specifically, I'll demonstrate the James-Stein estimator. It is a surprising result discovered by Charles Stein and later formalized by Willard James on estimating such parameters. What makes it stunning is the rather counter intuitive result it demonstrates. James-Stein's estimator approach states that if you wanted to simultaneously estimate a set of independent parameters from data, the maximum likelihood approach is only optimal if you have lesser than 3 parameters to estimate from! If you have more than 3, you are better off using the James-Stein estimator. There is an obvious thought experiment you can think of. If you really wanted just one parameter, could you not simply add an unrelated set of numbers, improve efficiency in estimation and then just use the parameter you are interested in? That's a flaw. The estimator works by minimizing the overall error in estimation so it may make more errors in some of the variables, lesser on others. But overall it will be better than just doing an individual maximum likelihood estimate on each variable. So you could use it if you don't mind seeing slightly bigger errors on some variables, smaller on others but lower overall error.

To see this work, I'll step through the actual algorithm in R. For more details you can refer the wikipedia entry here.The general approach is the following.

  • \(\hat\theta_{js}\) is the James-Stein estimate of the parameters we are interested in computing.
  • \(y\) be the set of values observed.
  • \(\hat{\sigma}^{2}\) be the estimated variance for all parameters
  • \(v\) be the mean of all parameters
then the James Stein estimator is given by

# n.rows is the number of samples we have
n.rows = 100
# n.params is the number of parameters
# we want to estimate
n.params = 30
# wins will hold the number of times
# the JS estimator beats the MLE estimate
wins   = c()
msejs  = c()
msemle = c()
for(iter in seq(1:1000)){
    # Create a sample of parameters
    # They have a range of 20-25
    x.act  = sample(20:25,size=n.params,replace=TRUE)
    # Now create a normal distribution for each of the parameters
    # uncomment below if you want to test it for a poisson distribution
    # m = mapply(rpois,mean=x.act,MoreArgs=list(n = n.rows))
    m = mapply(rnorm,mean=x.act,MoreArgs=list(n = n.rows,sd=10))
    # Find the global mean
    mbar = mean(colMeans(m))
    # Find the column means
    mu0   = colMeans(m)
    # Find global variance
    s2 = var(as.vector(m))/(n.rows*n.params)
    # Compute the adjustment value
    cval    = 1 - ((n.params - 2)*s2/sum((mu0 - mbar)^2))
    # Compute the JS estimate for your parameters
    jsest = mbar + cval *(mu0 - mbar)
    z = data.table(
        actual = x.act,
        mle.est = mu0,
        js.est = jsest)
    # Check to see if the JS estimate is better than MLE
    z[,counter := ifelse(abs(js.est - actual) < abs(mle.est - actual),1,0)]
    # In case you want to see what the numbers are for the
    # difference between absolute and actual estimates for
    # JS and MLE
    z[,jserr  := abs(js.est - actual)]
    z[,mleerr := abs(mle.est - actual)]
    # Record the wins for this iteration of the simulation
    # Repeat.
    wins[iter]   = sum(z$counter)
    msejs[iter]  = sum(z$jserr)
    msemle[iter] = sum(z$mleerr) 
# What are the mean wins? 
# What are the distribution of the mean wins
quantile(wins,prob = seq(0.1,0.9,by=0.1))
z = data.frame(wins = wins)
p = ggplot(z,aes(wins)) +
    geom_histogram(fill='light blue') +
    theme_bw() +
    ggtitle('Of the 30 parameters we wish to estimate:\n how many of them have estimates closer to the actual using the James-Stein estimator than the MLE?') +
    ylab('') +
    xlab('') +
    geom_vline(aes(xintercept=mean(wins),linetype='longdash',colour='blue')) +
    annotate('text',x=18,y=150,label='Mean -->') +

The above R code runs a simulation of sorts. It starts by making some random parameters you would want to estimate and simulates some normally distributed data from it. Next, it uses the James Stein estimator and estimates the very parameters it started off with, using the data. Finally, it compares and records how often the James Stein estimator was better/closer that the MLE estimate. The results from the simulation are shown below.

For kicks, you can try it for a Poisson distribution too, here is what the distribution looks like.

If you are interested in learning more about probability here are a list of books that are good to buy.

Friday, November 28, 2014

Bayesian Cognitive Bias

I chanced on this excellent puzzle on the net that tends to reveal a cognitive bias in our heads against Bayesian reasoning. The puzzle statement is quite simple

You are given four cards. Each card has a letter on one side and number on the other side. You are told the statement "If there is a D on one side, there is a 5 on the other side". Which two cards would you flip over to validate the statement?

The original article is here, think hard before you click through for an answer :)

Tuesday, September 2, 2014

Maximizing Chances in an Unfair Game

submit to reddit

Q: You are about to play a game wherein you flip a biased coin. The coin falls heads with probability \(p\) and tails with \(1 - p\) where \(p \le \frac{1}{2}\). You are forced to play by selecting heads so the game is biased against you. For every toss you make, your opponent gets to toss too. The winner of this game is the one who wins the toss the most. You, however get to choose the number of rounds that get played. Can you ever hope to win?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: At a first look, it might appear that the odds are stacked against you as you are forced to play by choosing heads. You would think that your chances or winning decrease as you play more and more. But, surprisingly there is a way to choose the optimal number of tosses (remember, you get to choose the number of times this game is played). To see how, lets crank out some numbers. If you get to toss the coin \(n\) times, then the total number of coin tosses you and your opponent flips is \(2n\).  Out of the \(2n\) tosses if \(y\) turns out heads, the probability that you would win is
P(\text{y Wins}) = {2n  \choose y} p^{y}(1 - p)^{2n - y}
In order to win, the value of \(y\) should run from \(n + 1\) to \(2n\) and the overall probability works out to
P(\text{Win}) = \sum_{y = n + 1}^{2n}{2n  \choose y} p^{y}(1 - p)^{2n - y}
We can work out the probability of winning by choosing various values of \(p\) and \(n\) and chart them out. Here is the R code that does it.

The code runs pretty quickly and uses the data.table package. All the processed data is contained in variables z and z1. They are plotted using the ggplot package to generate the following charts for the strategy.

The first chart shows the variation of the probability of winning by the number of games played for various probability bias values.

The next chart shows the optimal number of games to play for a given bias probability value.

Some good books to own for learning probability is listed here
Yet another fascinating area of probability are Monte Carlo methods. Here are a list of good books to own to learn Monte Carlo methods.

Tuesday, August 12, 2014

The Best Books for Monte Carlo Methods

The following are some of the best books to own to learn Monte Carlo methods for sampling and estimation problems

Monte Carlo Methods in Statistics (Springer)
This is a good book which discusses both Bayesian methods from a practical point of view as well as theoretical point of view with integrations. The explanations given are also fairly comprehensive. There are also a fair amount of examples in this text. Overall, this is an excellent book to own if you want to understand Monte Carlo sampling methods and algorithms at an intermediate to graduate level.

Explorations in Monte Carlo Methods (Undergraduate Texts in Mathematics)
This is good book to own to get you started on Monte Carlo methods. It starts with fairly simple and basic examples and illustrations. The mathematics used is also fairly basic. Buy this if you are at an undergraduate level and want to get into using Monte Carlo methods but have only a basic knowledge of statistics and probability.

Monte Carlo Methods in Financial Engineering (Stochastic Modelling and Applied Probability) (v. 53)
Another excellent book to own if you are curious to learn about the methods of Monte Carlo in the finance industry. Some really nice areas that are covered in the book include variance reduction techniques, diffusion equations, change point detections, Option pricing methods etc. Ideal for students of financial engineering or ones wanting to break into it. The book tends to overtly rate MC methods (well its a book on MC!).

Monte Carlo Simulation and Resampling Methods for Social Science
This book gives a good introduction and goes over some basic probability theory, statistics and distributions before it hops on to the Monte Carlo methods. This makes it a good introductory book for sampling methods. Recommended for undergraduates with minimal statistical background.

Simulation and Monte Carlo Method

An excellent book to own at the intermediate to graduate level. The text provides a good course in simulation and Monte Carlo methods. Some interesting topics covered in the text include rare-event simulation. The book assumes you have a background in statistics and probability theory.

Sunday, July 6, 2014

Embarrassing Questions, German Tanks and Estimations

submit to reddit

Q: You are conducting a survey and want to ask an embarrassing yes/no question to subjects. The subjects wouldn't answer that embarrassing question honestly unless they are guaranteed complete anonymity. How would you conduct the survey?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: One way to do this is to assign a fair coin to the subject and ask them to toss it in private. If it came out heads then answer the question truthfully else toss the coin a second time and record the result (heads = yes, tails = no). With some simple algebra you can estimate the proportion of users who have answered the question with a yes.

Assume total population surveyed is \(X\). Let \(Y\) subjects have answered with a "yes". Let \(p\) be the sort after proportion. The tree diagram below shows the user flow.

The total expected number of "yes" responses can be estimated as
\frac{pX}{2} + \frac{X}{4} = Y
which on simplification yields
p = \big(\frac{4Y}{X} - 1\big)\frac{1}{2}

Best Books on Probability

Q: A bag contains unknown number of tiles numbered in serial order \(1,2,3,...,n\). You draw \(k\) tiles from the bag without replacement and find the maximum number etched on them to be \(m\). What is your estimate of the number of tiles in the bag?

A: This and problems like these are called German War Tank problems. During WW-II German tanks were numbered in sequential order when they were manufactured. Allied forces needed an estimate of how many tanks were deployed and they had a handful of captured tanks and their serial numbers painted on them. Using this, statisticians estimated the actual tanks to be far lower than what intelligence estimates had them believe. So how does it work?

Let us assume we draw a sample of size \(k\). The maximum in that sample is \(m\). If we estimate the maximum of the population to be \(m\) then probability of the sample maximum to be \(m\) is
P(\text{Sample Max} = m) = \frac{m-1 \choose k-1}{N \choose k}
The \(-1\) figures because the maximum is already taken out of the sample leaving behind \(m - 1\) to choose \(k -1 \) from. The expected value of the maximum using this strategy is thus
E(\text{Maximum}) = \sum_{m=k}^{m=N}m\frac{m-1 \choose k-1}{N \choose k}
Note, we run the above summation from \(k\) to \(N\) as for \(m < k\) the expectation is \(0\) because the sample maximum has to be at least \(k\). After a series of algebraic manipulations ( ref ) the above simplifies to
E(\text{Maximum}) = M\big( 1 + \frac{1}{k}\big) - 1
which is quite an ingenious and simple way to estimate population size given serial number ordering.

If you are looking to buy some books on probability theory here is a good list.

Sunday, June 1, 2014

The Chakravala Algorithm in R

submit to reddit

A class of analysis that has piqued the interest of mathematicians across millennia are Diophantine equations. Diophantine equations are polynomials with multiple variables and seek integer solutions. A special case of Diophantine equations is the Pell's equation. The name is a bit of a misnomer as Euler mistakenly attributed it to the mathematician John Pell. The problem seeks integer solutions to the polynomial
x^{2} - Dy^{2} = 1
Several ancient mathematicians have attempted to study and find generic solutions to Pell's equation. The best known algorithm is the Chakravala algorithm discovered by Bhaskara circa 1114 AD. Bhaskara implicitly credits Brahmagupta (circa 598 AD) for it initial discovery, though some credit it to Jayadeva too. Several Sanskrit words used to describe the algorithm appear to have changed in the 500 years between the two implying other contributors. The Chakravala technique is simple and implementing it in any programming language should be a breeze (credit citation)

Diophantine Equations (Pure & Applied Mathematics)

The method works as follows. Find a trivial solution to the equation. \(x=1,y=0\) can be used all the time. Next, initialize two parameters \([p_i,k_i]\) where \(i\) is an iteration count. \(p_i\) is updated to \(p_{i+1}\) if the following two criteria are satisfied.
  • \(p_i + p_{i+1} mod  k_{i} = 0\) i.e. \(p_i + p_{i+1}\) is divisible by \(k_i\)
  • \(| p_{i+1} - d |^{2}\) is minimized
After updating \(p_{i+1}\) \(k_{i+1}\) is found by evaluating
k_{i+1} = \frac{p_{i+1}^{2} - d}{k_{i}}
and the next pair of values for \([x,y]\) is computed as
x_{i+1} = \frac{p_{i+1}x_i + dy_{i}}{|k_{i}|}
y_{i+1} = \frac{p_{i+1}y_i + x_{i}}{|k_{i}|}
The algorithm also has an easy way to check if the found solution is a solution. It does so by only accepting values where \(k_{i} = 1\).

A screen grab of the entire algorithm done in R is shown below.

A Related Puzzle:
A drawer has \(x\) black socks and \(y\) white socks. You draw two socks consecutively and they are both black. You repeat this several times (by replacing the socks) and find that you get a pair of blacks with probability \(\frac{1}{2}\). You know that there are no more than 30 socks in the draw in total. How many black and white socks are there?

The probability that you would draw two black socks in a row is
P = \frac{x}{x+y}\times\frac{x - 1}{x+y - 1} = \frac{1}{2}
Simplifying and solving for \(x\) yields
x^{2} - (2y + 1)x + y - y^2 = 0
which on further simplification gives
x = \frac{2y + 1 \pm \sqrt{(2y+1)^2 +4(y^2 - y)}}{2}
We can ignore the root with the negative sign as it would yield a negative value for \(x\) which is impossible. The positive root of the quadratic equation yields
x = \frac{2y + 1 + \sqrt{8y^2 + 1}}{2}
For \(x\) to be an integer, the term \(\sqrt{8y^2 + 1}\) has to be an odd integer number \(z\) (say). We can now write it out as
z = \sqrt{8y^2 + 1}
z^{2} - 8y^2 = 1
This is Pell's equation (or Vargaprakriti in Sanskrit).
As we know that there are no more than 30 socks in the draw, we can quickly work our way to two admissible solutions to the problem \(\{3,1\}, \{15,6\}\).
If you are looking to buy books on probability theory here is a good list to own.
If you are looking to buy books on time series analysis here is an excellent list to own.