ShareButton

Monday, January 16, 2017

The Bacteria Division Puzzle



Q. A jar has a single cell of a bacteria. After every fixed interval of time the bacteria either splits into two with probability 2/5, does nothing with probability 2/5 or dies out with probability 1/5. What is the probability that the bacteria would continue to make a large family tree over an extended period of time?

A. The situation can be described by the following visual.
Assume that the required probability is 'p'. The term 1 - p would represent the probability that the ecosystem eventually dies out. Each of the above scenarios contributes a quantum of probability towards the ecosystem eventually dying out. Lets start off by represent 1 - p as 'x'. The probability that the bacteria die out is


The total of each of the above must add up to the probability that the bacteria eventually die out, which is 'x'. So you can phrase the problem recursively as


This simplifies to

which is a quadratic equation, yielding a solution as


This further yields x = 1/2 or x = 1. It's easy to see why x = 1 isn't a solution because scenario 1 would rule that out immediately. This yields a probability of 50%.

In order to test this out, it's also relatively easy to code it up in Python. The following code shows how to simulate this scenario.

import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


dist_estimates = []

for index in range(400):
    counter = 0
    ''' 
    Repeating 100 times for estimating the probability
    '''
    for epoch in range(100):
        num_bacteria = 1
        '''
        Looking ahead 10 steps
        '''
        for iter in range(20):
            bacteria = np.zeros(int(num_bacteria))
            '''
            Step through each bacteria
            '''
            for nb_iter in range(int(num_bacteria)):
                '''
                Decide if each of the bacteria
                will die/stay/split in this time slice
                '''
                prob = np.random.uniform(0,1)
                if prob < 0.2:
                    ''' This one dies '''
                    bacteria[nb_iter] = 0
                elif prob >= 0.2 and prob <= 0.6:
                    ''' This one stays as is '''
                    bacteria[nb_iter] = 1
                else:
                    ''' This one splits into two '''
                    bacteria[nb_iter] = 2
            num_bacteria = np.sum(bacteria)
        if np.sum(bacteria) > 0:
            counter += 1
    dist_estimates.append(float(counter)/100)
'''
Visualization is easy
Use matplotlib and seaborn
'''

p = sns.distplot(dist_estimates,kde=False,rug=True).get_figure()
p.savefig('t.png')


The above code generates a plot of estimates for the said probability. This came out the following way which validates our method.

If you are looking to learn probability, programming or data science related books, a good set of books are listed here


Thursday, January 5, 2017

The Forgotten Geometric Mean.

Often times a lot of people working with data are trying to create an index of some sort. Something that captures a set of key business metrics. If you are a site (or an app) you want to create some sort of an engagement index, which if trending up implies good things are happening, bad if it is trending down. The creators of such metrics (think analysts) tend to prefer a weighted arithmetic mean of the influencing factors. If the influencing factors are f1,f2, f3 (say) with weights w1, w2, w3 then the index would be computed as


However, what does not get factored in are the final consumers of the index (think product managers) and there could be many. They will invariably try to check it with something else they have handy. For example, if clicks on a site went up 20% the index may be up by just 5% (say) or vice-versa. If resources are being allocated based on the movement of such an index, it will invariably lead to contention on what is the right weighting to be given to each factor.

This is meant to be a short write up on some really cool features of the geometric mean. The geometric mean is not meant to replace a simple arithmetic mean based index, but it is definitely worth the thought. To illustrate what this aspect is, lets take a look at a simple two feature index. If the features are X and Y the arithmetic mean index can be represented as


To see how it responds to changes, lets take the derivative.


Clearly the derivative is dependent on the chosen weight. Lets see what happens when we choose the geometric mean.


Again, to see how it responds to change, lets take the derivative.


which can be further simplified to

The result is a useful derivable condition

i.e. the percentage change in the index is directly proportional to the percentage change in the feature.
Note, there are no hand chosen weights here. A five percent change in one of the influencing factors will result in a proportional percent change in the index. Extremely useful !

Yet another aspect consumers like to quantify is growth. If the index went up by x1 and x2 in consecutive years, what is the average quarterly/annual growth? If we took it as the average of x1 and x2, then the growth after two years (say) would be estimated as


Contrast that to the actual growth


Clearly some terms cancel out. We are left comparing


Notice one of them is the arithmetic mean and the other is the geometric mean. We also know from a well established theorem that the arithmetic mean is always greater than the geometric mean described here. So we would always end up overestimating the growth!

So how would we choose a value to project as an average growth rate? We are looking for a beta in the below equation


Yet again stating the average growth as the geometric mean gives the end user a handy metric to work with.

If you are interested in learning probability here are a set of good books to choose and buy from.

Monday, January 2, 2017

Colored Cards and Numbers Puzzle

Q: You have a set of thirty six cards. The cards are six in color ( six each) and each color is numbered from 1 to 6. You draw two cards at random. What is probability that they are of a different color and have a different number?

A: The first card can be drawn at random. It does not matter what its color or number is. To compute the probability that the second card is different in color and number from the first, it helps to visualize the situation in a simple way as shown below.


In the figure above, assume the green dot represents the card that was picked. The marked out cards represent the cards that should not be picked to get a different color and number. Also, the act of picking a card bought down the pool of cards from 36 to 35. The remaining unmarked space represents the available set of cards to pick from. This can be computed easily as


This yields an overall probability of


If you are interested in learning the art of probability, some of the best books to learn it from are listed here.

Wednesday, October 26, 2016

The Magic of Numba and Python

Python is a great programming language. It's primary merits are readability and the numerous packages that are available online. However, being an interpreted language the speed of execution is always an issue. Here is a simple example of a piece of code written in Python that tries to add two numbers from a grid.

#!/usr/bin/python

def overlap_pp(x,y):
    count = 0
    for i in range(x):
        for j in range(y):
            count += i + j
    return count

for _ in range(1000):
    q = overlap_pp(500,500)

If you run the above script (saved as n.py) on the command line terminal with the time command you should see some numbers like the below

time ./n.py

real 0m22.379s
user 0m22.363s
sys 0m0.407s

The process running on one processor took about 22 seconds to complete the run. Now lets do something seemingly magical. Lets add a decorator. Decorators are python speak for a mechanism to do wrapper functions. The decorator we are going to add is '@jit'. The code looks like as it is shown below

#!/usr/bin/python

from numba import jit

@jit
def overlap_pp(x,y):
    count = 0
    for i in range(x):
        for j in range(y):
            count += i + j
    return count

for _ in range(1000):
    q = overlap_pp(500,500)

Now let's time this as before.

time ./n.py

real 0m0.420s
user 0m0.370s
sys 0m0.443s



Magic! The exact same code now runs 55x faster! This happens because the Just-In-Time (jit) feature of the numba package compiles the function to machine code on the fly. The timing you see is almost in line with what you would expect from a similar code done in C or Fortran. In fact you will be surprised that in this particular example, the numba version runs faster than the C version! The C code is shown below.
#include < stdio.h >

int overlap_pp(int x,int y){
  int i,j,q;
  for(i=0;i<= x;i++){
    for(j=0;j<= y;j++){
      q += i + j;
    }
  }
  return(q);
}


int main(int argc,char** argv){
  int i,q;
  for(i = 0;i<=1000;i++){
    q = overlap_pp(500,500);
  }
  return 0;
}

The timing results after compiling the above code 'gcc t.c -o t'

time ./t

real 0m0.774s
user 0m0.774s
sys 0m0.000s

Unfortunately there isn't a clear description online on how to get Numba installed on Ubuntu 14.04. After some hacking and poking around stackoverflow the following worked for me.

Step 1:
Numba uses llvm. You need to remove all llvm, llvm-config and llvmlite installations that you already have on your system. Numba needs llvm-3.8 upwards to work properly.

Step 2:
Follow instructions in this stackoverflow thread. But when you get installing do the following
 sudo apt-get install zlib1g zlib1g-dev libedit2 libedit-dev llvm-3.8 llvm-3.8-dev
Notice change in version number to 3.8.

Step 3:
Next, install llvmlite. You also want to create a symbolic link on '/usr/bin/' to point to the latest 3.8 version of llvm-config.
"sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config"
This step is important because the installation of numba appears to use llvm-config. After this you can install Numba directly using pip.

If you are interested in learning more about python programming, data science and probability here are a list of books that are worthwhile to buy.


Sunday, May 15, 2016

The James-Stein Estimator

This write up is about an estimator. A statistical estimator, is used when you already have a model in mind, data at hand and want to estimate some parameters needed for the model. For example, you want to predict how many runs a batter would make in a game given recent history \(x_1,x_2,x_3,\ldots\) . If we assume (we are making a choice of a model now) that the scores come from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) then the probability density function for a given value \(x\) is



The likelihood that a series of points \(x_1,x_2,x_3,\ldots\) come from such a distribution can be expressed as





Next is basic calculus. Take the logarithm on both sides, set the partial derivative w.r.t. \(\mu\) to zero yields (excluding the algebra)




To verify, you also need to check the second derivative's sign to see if its negative to ensure that it is indeed a maxima you have found.

So a simple estimator would be to use the average runs scored from the past \(n\) days. The average is the result of a "maximum likelihood" approach briefly described above.What if you had 5 players you wanted to estimate the average for? Intuition would tell you that you should simply compute the average for each. Wrong! The James-Stein approach provides a way to pool the data such that the overall error made in estimation is minimized.

Specifically, I'll demonstrate the James-Stein estimator. It is a surprising result discovered by Charles Stein and later formalized by Willard James on estimating such parameters. What makes it stunning is the rather counter intuitive result it demonstrates. James-Stein's estimator approach states that if you wanted to simultaneously estimate a set of independent parameters from data, the maximum likelihood approach is only optimal if you have lesser than 3 parameters to estimate from! If you have more than 3, you are better off using the James-Stein estimator. There is an obvious thought experiment you can think of. If you really wanted just one parameter, could you not simply add an unrelated set of numbers, improve efficiency in estimation and then just use the parameter you are interested in? That's a flaw. The estimator works by minimizing the overall error in estimation so it may make more errors in some of the variables, lesser on others. But overall it will be better than just doing an individual maximum likelihood estimate on each variable. So you could use it if you don't mind seeing slightly bigger errors on some variables, smaller on others but lower overall error.

To see this work, I'll step through the actual algorithm in R. For more details you can refer the wikipedia entry here.The general approach is the following.

Let
  • \(\hat\theta_{js}\) is the James-Stein estimate of the parameters we are interested in computing.
  • \(y\) be the set of values observed.
  • \(\hat{\sigma}^{2}\) be the estimated variance for all parameters
  • \(v\) be the mean of all parameters
then the James Stein estimator is given by




#!/usr/bin/Rscript
 
suppressMessages(library(data.table))
suppressMessages(library(ggplot2))
 
# n.rows is the number of samples we have
n.rows = 100
 
# n.params is the number of parameters
# we want to estimate
n.params = 30
 
# wins will hold the number of times
# the JS estimator beats the MLE estimate
wins   = c()
msejs  = c()
msemle = c()
 
for(iter in seq(1:1000)){
    # Create a sample of parameters
    # They have a range of 20-25
    x.act  = sample(20:25,size=n.params,replace=TRUE)
 
    # Now create a normal distribution for each of the parameters
    # uncomment below if you want to test it for a poisson distribution
    # m = mapply(rpois,mean=x.act,MoreArgs=list(n = n.rows))
    m = mapply(rnorm,mean=x.act,MoreArgs=list(n = n.rows,sd=10))
 
    # Find the global mean
    mbar = mean(colMeans(m))
 
    # Find the column means
    mu0   = colMeans(m)
 
    # Find global variance
    s2 = var(as.vector(m))/(n.rows*n.params)
 
    # Compute the adjustment value
    cval    = 1 - ((n.params - 2)*s2/sum((mu0 - mbar)^2))
 
    # Compute the JS estimate for your parameters
    jsest = mbar + cval *(mu0 - mbar)
    z = data.table(
        actual = x.act,
        mle.est = mu0,
        js.est = jsest)
 
    # Check to see if the JS estimate is better than MLE
    z[,counter := ifelse(abs(js.est - actual) < abs(mle.est - actual),1,0)]
 
    # In case you want to see what the numbers are for the
    # difference between absolute and actual estimates for
    # JS and MLE
    z[,jserr  := abs(js.est - actual)]
    z[,mleerr := abs(mle.est - actual)]
 
    # Record the wins for this iteration of the simulation
    # Repeat.
    wins[iter]   = sum(z$counter)
    msejs[iter]  = sum(z$jserr)
    msemle[iter] = sum(z$mleerr) 
}
 
# What are the mean wins? 
mean(wins)
 
# What are the distribution of the mean wins
quantile(wins,prob = seq(0.1,0.9,by=0.1))
 
z = data.frame(wins = wins)
p = ggplot(z,aes(wins)) +
    geom_histogram(fill='light blue') +
    theme_bw() +
    ggtitle('Of the 30 parameters we wish to estimate:\n how many of them have estimates closer to the actual using the James-Stein estimator than the MLE?') +
    ylab('') +
    xlab('') +
    geom_vline(aes(xintercept=mean(wins),linetype='longdash',colour='blue')) +
    annotate('text',x=18,y=150,label='Mean -->') +
    theme(legend.position='none')
 
png('xyz.png',width=800,height=500)
print(p)
dev.off()


The above R code runs a simulation of sorts. It starts by making some random parameters you would want to estimate and simulates some normally distributed data from it. Next, it uses the James Stein estimator and estimates the very parameters it started off with, using the data. Finally, it compares and records how often the James Stein estimator was better/closer that the MLE estimate. The results from the simulation are shown below.



For kicks, you can try it for a Poisson distribution too, here is what the distribution looks like.



If you are interested in learning more about probability here are a list of books that are good to buy.

Friday, November 28, 2014

Bayesian Cognitive Bias

I chanced on this excellent puzzle on the net that tends to reveal a cognitive bias in our heads against Bayesian reasoning. The puzzle statement is quite simple

You are given four cards. Each card has a letter on one side and number on the other side. You are told the statement "If there is a D on one side, there is a 5 on the other side". Which two cards would you flip over to validate the statement?






The original article is here, think hard before you click through for an answer :)

Tuesday, September 2, 2014

Maximizing Chances in an Unfair Game




submit to reddit


Q: You are about to play a game wherein you flip a biased coin. The coin falls heads with probability \(p\) and tails with \(1 - p\) where \(p \le \frac{1}{2}\). You are forced to play by selecting heads so the game is biased against you. For every toss you make, your opponent gets to toss too. The winner of this game is the one who wins the toss the most. You, however get to choose the number of rounds that get played. Can you ever hope to win?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: At a first look, it might appear that the odds are stacked against you as you are forced to play by choosing heads. You would think that your chances or winning decrease as you play more and more. But, surprisingly there is a way to choose the optimal number of tosses (remember, you get to choose the number of times this game is played). To see how, lets crank out some numbers. If you get to toss the coin \(n\) times, then the total number of coin tosses you and your opponent flips is \(2n\).  Out of the \(2n\) tosses if \(y\) turns out heads, the probability that you would win is
$$
P(\text{y Wins}) = {2n  \choose y} p^{y}(1 - p)^{2n - y}
$$
In order to win, the value of \(y\) should run from \(n + 1\) to \(2n\) and the overall probability works out to
$$
P(\text{Win}) = \sum_{y = n + 1}^{2n}{2n  \choose y} p^{y}(1 - p)^{2n - y}
$$
We can work out the probability of winning by choosing various values of \(p\) and \(n\) and chart them out. Here is the R code that does it.


The code runs pretty quickly and uses the data.table package. All the processed data is contained in variables z and z1. They are plotted using the ggplot package to generate the following charts for the strategy.

The first chart shows the variation of the probability of winning by the number of games played for various probability bias values.

The next chart shows the optimal number of games to play for a given bias probability value.

Some good books to own for learning probability is listed here
Yet another fascinating area of probability are Monte Carlo methods. Here are a list of good books to own to learn Monte Carlo methods.