ShareButton

Tuesday, August 12, 2014

The Best Books for Monte Carlo Methods

The following are some of the best books to own to learn Monte Carlo methods for sampling and estimation problems

Monte Carlo Methods in Statistics (Springer)
This is a good book which discusses both Bayesian methods from a practical point of view as well as theoretical point of view with integrations. The explanations given are also fairly comprehensive. There are also a fair amount of examples in this text. Overall, this is an excellent book to own if you want to understand Monte Carlo sampling methods and algorithms at an intermediate to graduate level.

Explorations in Monte Carlo Methods (Undergraduate Texts in Mathematics)
This is good book to own to get you started on Monte Carlo methods. It starts with fairly simple and basic examples and illustrations. The mathematics used is also fairly basic. Buy this if you are at an undergraduate level and want to get into using Monte Carlo methods but have only a basic knowledge of statistics and probability.

Monte Carlo Methods in Financial Engineering (Stochastic Modelling and Applied Probability) (v. 53)
Another excellent book to own if you are curious to learn about the methods of Monte Carlo in the finance industry. Some really nice areas that are covered in the book include variance reduction techniques, diffusion equations, change point detections, Option pricing methods etc. Ideal for students of financial engineering or ones wanting to break into it. The book tends to overtly rate MC methods (well its a book on MC!).

Monte Carlo Simulation and Resampling Methods for Social Science
This book gives a good introduction and goes over some basic probability theory, statistics and distributions before it hops on to the Monte Carlo methods. This makes it a good introductory book for sampling methods. Recommended for undergraduates with minimal statistical background.

Simulation and Monte Carlo Method

An excellent book to own at the intermediate to graduate level. The text provides a good course in simulation and Monte Carlo methods. Some interesting topics covered in the text include rare-event simulation. The book assumes you have a background in statistics and probability theory.



















Sunday, July 6, 2014

Embarrassing Questions, German Tanks and Estimations


submit to reddit

Q: You are conducting a survey and want to ask an embarrassing yes/no question to subjects. The subjects wouldn't answer that embarrassing question honestly unless they are guaranteed complete anonymity. How would you conduct the survey?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: One way to do this is to assign a fair coin to the subject and ask them to toss it in private. If it came out heads then answer the question truthfully else toss the coin a second time and record the result (heads = yes, tails = no). With some simple algebra you can estimate the proportion of users who have answered the question with a yes.

Assume total population surveyed is \(X\). Let \(Y\) subjects have answered with a "yes". Let \(p\) be the sort after proportion. The tree diagram below shows the user flow.


The total expected number of "yes" responses can be estimated as
$$
\frac{pX}{2} + \frac{X}{4} = Y
$$
which on simplification yields
$$
p = \big(\frac{4Y}{X} - 1\big)\frac{1}{2}
$$

Best Books on Probability

Q: A bag contains unknown number of tiles numbered in serial order \(1,2,3,...,n\). You draw \(k\) tiles from the bag without replacement and find the maximum number etched on them to be \(m\). What is your estimate of the number of tiles in the bag?

A: This and problems like these are called German War Tank problems. During WW-II German tanks were numbered in sequential order when they were manufactured. Allied forces needed an estimate of how many tanks were deployed and they had a handful of captured tanks and their serial numbers painted on them. Using this, statisticians estimated the actual tanks to be far lower than what intelligence estimates had them believe. So how does it work?

Let us assume we draw a sample of size \(k\). The maximum in that sample is \(m\). If we estimate the maximum of the population to be \(m\) then probability of the sample maximum to be \(m\) is
$$
P(\text{Sample Max} = m) = \frac{m-1 \choose k-1}{N \choose k}
$$
The \(-1\) figures because the maximum is already taken out of the sample leaving behind \(m - 1\) to choose \(k -1 \) from. The expected value of the maximum using this strategy is thus
$$
E(\text{Maximum}) = \sum_{m=k}^{m=N}m\frac{m-1 \choose k-1}{N \choose k}
$$
Note, we run the above summation from \(k\) to \(N\) as for \(m < k\) the expectation is \(0\) because the sample maximum has to be at least \(k\). After a series of algebraic manipulations ( ref ) the above simplifies to
$$
E(\text{Maximum}) = M\big( 1 + \frac{1}{k}\big) - 1
$$
which is quite an ingenious and simple way to estimate population size given serial number ordering.

If you are looking to buy some books on probability theory here is a good list.

Sunday, June 1, 2014

The Chakravala Algorithm in R


submit to reddit

A class of analysis that has piqued the interest of mathematicians across millennia are Diophantine equations. Diophantine equations are polynomials with multiple variables and seek integer solutions. A special case of Diophantine equations is the Pell's equation. The name is a bit of a misnomer as Euler mistakenly attributed it to the mathematician John Pell. The problem seeks integer solutions to the polynomial
$$
x^{2} - Dy^{2} = 1
$$
Several ancient mathematicians have attempted to study and find generic solutions to Pell's equation. The best known algorithm is the Chakravala algorithm discovered by Bhaskara circa 1114 AD. Bhaskara implicitly credits Brahmagupta (circa 598 AD) for it initial discovery, though some credit it to Jayadeva too. Several Sanskrit words used to describe the algorithm appear to have changed in the 500 years between the two implying other contributors. The Chakravala technique is simple and implementing it in any programming language should be a breeze (credit citation)

Diophantine Equations (Pure & Applied Mathematics)


The method works as follows. Find a trivial solution to the equation. \(x=1,y=0\) can be used all the time. Next, initialize two parameters \([p_i,k_i]\) where \(i\) is an iteration count. \(p_i\) is updated to \(p_{i+1}\) if the following two criteria are satisfied.
  • \(p_i + p_{i+1} mod  k_{i} = 0\) i.e. \(p_i + p_{i+1}\) is divisible by \(k_i\)
  • \(| p_{i+1} - d |^{2}\) is minimized
After updating \(p_{i+1}\) \(k_{i+1}\) is found by evaluating
$$
k_{i+1} = \frac{p_{i+1}^{2} - d}{k_{i}}
$$
and the next pair of values for \([x,y]\) is computed as
$$
x_{i+1} = \frac{p_{i+1}x_i + dy_{i}}{|k_{i}|}
y_{i+1} = \frac{p_{i+1}y_i + x_{i}}{|k_{i}|}
$$
The algorithm also has an easy way to check if the found solution is a solution. It does so by only accepting values where \(k_{i} = 1\).

A screen grab of the entire algorithm done in R is shown below.


A Related Puzzle:
A drawer has \(x\) black socks and \(y\) white socks. You draw two socks consecutively and they are both black. You repeat this several times (by replacing the socks) and find that you get a pair of blacks with probability \(\frac{1}{2}\). You know that there are no more than 30 socks in the draw in total. How many black and white socks are there?

The probability that you would draw two black socks in a row is
$$
P = \frac{x}{x+y}\times\frac{x - 1}{x+y - 1} = \frac{1}{2}
$$
Simplifying and solving for \(x\) yields
$$
x^{2} - (2y + 1)x + y - y^2 = 0
$$
which on further simplification gives
$$
x = \frac{2y + 1 \pm \sqrt{(2y+1)^2 +4(y^2 - y)}}{2}
$$
We can ignore the root with the negative sign as it would yield a negative value for \(x\) which is impossible. The positive root of the quadratic equation yields
$$
x = \frac{2y + 1 + \sqrt{8y^2 + 1}}{2}
$$
For \(x\) to be an integer, the term \(\sqrt{8y^2 + 1}\) has to be an odd integer number \(z\) (say). We can now write it out as
$$
z = \sqrt{8y^2 + 1}
$$
or
$$
z^{2} - 8y^2 = 1
$$
This is Pell's equation (or Vargaprakriti in Sanskrit).
As we know that there are no more than 30 socks in the draw, we can quickly work our way to two admissible solutions to the problem \(\{3,1\}, \{15,6\}\).
If you are looking to buy books on probability theory here is a good list to own.
If you are looking to buy books on time series analysis here is an excellent list to own.

Wednesday, May 7, 2014

Hopping Robots and Reinforcement Learning


submit to reddit

All too often, when we deal with data the outcome needed is a strategy or an algorithm itself. To arrive at that strategy we may have historic data or some model on how entities in system respond to various situations. In this write up, I'll go over the method of reinforcement learning. The general idea behind reinforcement learning is to come up with a strategy to maximize some measurable goal. For example, if you are modelling a robot that learns to navigate around obstacles, you want the learning process to come back with a strategy that minimizes collisions (say) with other entities in the environment.

Pattern Recognition and Machine Learning (Information Science and Statistics)

For the sake of simplicity, lets assume the following scenario. A robot is placed (at random) on flat plank of wood which has some sticky glue in the center. To its left there is a hole which damages the robot a bit and to its right is a reward which is its destination, as shown in the figure below


The robot has the following capabilities
  • It can sense what state \({S_1,S_2,...,S_7}\) it is in
  • It can detect a reward or damage that happens to it while in a particular state
  • It can move one space left or right
  • It can hop a space on to the right
If the robot ends up in either of the terminal states of \(S_1\) or \(S_7\) it is taken and placed in another state at random. State \(S_1\) does damage to the robot while state \(S_7\) rewards it. \(S_4\) is not exactly a damage causing state but it is an avoidable state and we want to learn that over time. All other states are neither harmful nor rewarding. The robot's goal is to "learn" to get to state \(S_7\) in as few steps as possible.

In order to get reinforcement learning to work, you need to know what the reward values are for each of the states the robot can be in. In this particular example, we will assume the following reward structure for each of the states.
Note, the numbers are fairly arbitrary. In addition to this we need a function or a table, mapping out actions/states pairs leading to new states. Given the robot's movement description above, we can use a table as follows
Zero is being used to designate the scenario when the robot is reset to a random state. With the above two sets of information we are good to start the learning process.

Reinforcement learning works in the following way. You maintain a matrix of values for each state/action pair. This is the reward that can be attained by arriving at a particular state by taking a particular action. But the trick is to also account for what possible future state you can get to, given that you arrive at a given state. For example it may be beneficial to be in a certain state \(X\), but all downstream states from \(X\) may be bad to be in. You want to avoid such states. This way, the algorithm tries to ensure that future favourable and achievable states are taken into account. The algorithm does not immediately update itself based on whatever it learns, it preserves old values and learns gradually. The above methodology can be stated as follows.

If \(Q^{*}(s_t,a_t)\) represents the pay off received by being in state then

$$
Q^{*}(s_t,a_t) = R(s_{t+1},s_t) + \alpha max_{a_t}Q^{*}(s_{t+1},a_{t+1})
$$

To slow down learning a bit, we stick with whatever prior estimate of \(Q^{*}(s_t,a_t)\) we have by a fraction of \(\beta\) as shown below

$$
Q^{*}_{new}(s_t,a_t) = \beta Q^{*}_{prior}(s_t,a_t) + (1 - \beta)Q^{*}(s_t,a_t)
$$
That's it! We now let the system take on various initial states, and let the device play around moving over to different states while we constantly update our \(Q\) matrix. After several iterations, the \(Q\) matrix will end with some values which reflect what strategy to take.

To give a swirl, here is an R code that walks through the entire process for this particular example



A thing to note in the code is how the reward function is encoded. There is a penalty imposed on moving from a higher state to a lower state. This is the simple way to ensure that whatever strategy it comes up with does not involve  going backwards from rightward positional gains that have been made. If you try it without the penalty, you will see cases where the strategy does not care going left or right in some states. The above code when run, creates the following output (screen grab below)


The rows represent the states, the columns represent the three possible actions move-left, move-right and hop-right that the robot can take. Notice what it's trying to say:
  • State 1, do nothing 
  • State 2, move right, but don't hop
  • State 3, don't move right, hop
  • State 4, move right or hop
  • State 5, move right (small chance), hop (definitely)
  • State 6, move right, move left (small chance)
This is exactly what you would expect and want! More on this subject can be read from the following books
Reinforcement Learning: State-of-the-Art (Adaptation, Learning, and Optimization)
an
Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)
If you are looking to buy some books on probability theory here is a good list to own.

Thursday, April 17, 2014

The Two Strategies


submit to reddit
Q: You are in a game where you get to toss a pair of coins once. There are two boxes (A & B) holding a pair each. Box A's coins are fair however B's coins are biased with probability of heads being \(0.6\) and \(0.4\) respectively. You are paid for the expected number of heads you will win. Which of the boxes should you pick?

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

A: The expected number of heads if you chose box A is easy to calculate as
$$
E(\text{heads}| A) = \frac{1}{2} + \frac{1}{2} = 1
$$
However the expected number of heads if you chose box B is also the same
$$
E(\text{heads}| B) = \frac{4}{10} + \frac{6}{10} = 1
$$
The average yield being the same could make one think that both boxes yield the same. However there is one difference, its the variance. The variance of a distribution of a random variable \(X\) is defined as
$$
Var(X) = \sum_{i=0}^{N} (x_i - \bar{x})^{2}p_i
$$
where \(p_i\) is the probability of \(x_i\). Given this, lets compute the variance of each strategy
$$
Var(X | \text{Box=A}) = (\frac{1}{2} - 1)^2 \frac{1}{2} + (\frac{1}{2} - 1)^2 \frac{1}{2} = \frac{1}{4} = 0.25 \\
Var(X | \text{Box=B}) = (\frac{2}{5} - 1)^2 \frac{2}{5} + (\frac{3}{5} - 1)^2 \frac{3}{5} = \frac{6}{25} = 0.24
$$
The variance for box B is slightly tighter than box A which makes choosing the coins in box B a better strategy.

If you are looking to buy some books on probability theory here is a good list.

Sunday, March 23, 2014

Linear Regression, Transforms and Regularization


submit to reddit
This write up is about the simple linear regression and ways to make it robust to outliers and non linearity. The linear regression method is a simple and powerful method. It is powerful because it helps compress a lot of information through a simple straight line. The complexity of the problem is vastly simplified. However being so simple comes with its set of limitations. For example, the method assumes that after a fit is made, the differences between the predicted and actual values are normally distributed. In reality, we rarely run into such ideal conditions. Almost always there is non-normality and outliers in the data that makes fitting a straight line insufficient. However there are some tricks you could do to make it better.

Statistics: A good book to learn statistics

As an example data set consider some dummy data shown in the table/chart below. Notice, value 33 is an outlier. When charted. you can see there is some non-linearity in the data too, for higher values of \(x\)
First lets tackle the non-linearity. The non-linearity can be managed by doing a transformation onthe y-values using the box-cox transform which is a class of power transformation. It is a useful transform to bring about normality in time series values that looks "curvy". The transform looks like
$$
\hat{y} = \frac{y^{\lambda} - 1}{\lambda}
$$
The value of lambda needs to be chosen optimally that maximizes the log likelihood that would make the time series more like it came from a normal distribution. Most statistical tools out there do it for you. In R, the function "boxcox" in the package "MASS" does it for you. The following code snippet computes the optimal value of \(\lambda\) as -0.30303


Let's apply this transformation to the data and see how it looks on a chart.

You can see that it looks a lot better and more like a straight line, except for the outlier at \(x = 4\). The goodness of straight line fit is measured by the fit's \(R^2\) value. The \(R^2\) value tries to quantify the amount of variance that can be explained by the fit. If we fitted a straight line through the original data we get an \(R^2 = 0.06804 \), and the transformed data yields an \(R^2 = 0.4636\) demonstrating an improvement. Next, lets try and manage the outlier. If you try fitting a straight line \(y = \alpha_0 + \alpha_{1} x\) through a set of points that have a few outliers you will notice that the values of \(\alpha_{0}, \alpha_{1}\) tend to be slightly large. They end up being slightly large because the fit is trying to "reach out" and accommodate the outlier. In order to minimize the effect of the outlier we get back into the guts of the linear regression. A linear regression typically tries to minimize the overall error \(e\) as computed as
$$
e = \sum_{i=1}^{N}\frac{(y_{actual} - \alpha_{0} - \alpha_{1}x)^2}{N}
$$
where \(N\) is the number of points. We can tweak the above equation to minimize as follows
$$
e = \sum_{i=1}^{N}\frac{(y_{actual} - \alpha_{0} - \alpha_{1}x)^2}{N} + \lambda(\alpha_{0}^2 + \alpha_{1}^2)
$$
The tweaked error equation forces towards choices of \(\alpha_{0}, \alpha_{1}\) where they cannot take larger values. The optimal value of \(\lambda\) needs to be ascertained by tuning. A formulaic solution does not exist, so we use another tool in R, the function "optim". This function lets you do basic optimization and minimizes any function you pass it, along with required parameters. It returns parameter values that minimize this function. The actual usage of this function isn't exactly intuitive. Most examples on the internet talk of minimizing a proper well formed function. Most real life applications involve minimizing functions having lots of parameters and additional data. The funct"optim" accepts the "..." argument which is a means to pass through arguments to the function you want to minimize. So here is how you would do it in R, all of it.



The above code walks through this example by calling optim. It finally outputs the fits in original domain using all three methods
  1. The grey line represents the fit if you simply used "lm"
  2. The red line represents the fit if you transformed the data and used "lm" in the transformed domain but without regularization. Note: you are clearly worse off.
  3. The blue dotted line shows the fit if you used transformation and regularization, clearly a much better fit



The function "optim" has lots of methods that it uses for finding the minimum value of the function. Typically, you may also want to poke around with the best value of \(\lambda\) in the minimization function to get better fits.
If you are looking to buy some books on time series analysis here is a good collection. Some good books to own for probability theory are referenced here

Sunday, March 2, 2014

The Lazy Apprentice


submit to reddit

Q: A shopkeeper hires an apprentice for his store which gets one customer per minute on average uniformly randomly. The apprentice is expected to leave the shop open until at least 6 minutes have passed when no customer arrives. The shop keeper suspects that the apprentice is lazy and wants to close the shop at a shorter notice. The apprentice claims (and the shopkeeper verifies), that the shop is open for about 2.5hrs on average. How could the shopkeeper back his claim?

Statistics: A good book to learn statistics

A: Per the contract, at least 6 minutes should pass without a single customer showing up before the apprentice can close the shop. To solve this lets tackle a different problem first. Assume you have a biased coin with a probability \(p\) of landing heads. What is the expected number of tosses before you get \(n\) heads in a row. The expected number of tosses to get to the first head is simple enough to calculate, its \(\frac{1}{p}\). How about two heads? We can formulate this recursively. We need to get to a head first. Following this, you need to toss the coin one more time for sure. With a probability \(p\) you get the second heads or with a probability \(1 - p\) you have to start over again. The number of tosses to two heads is thus \(\frac{1}{p} + 1 + \frac{1}{p}\times(1-p)\).

Extending this out to get \(n\) tosses, if you assume that \(y(n)\) is the expected number of tosses to get to \(n\) heads in a row then the following state transition diagram shows how the transitions happen.

From the state \(y_{n-1}\), with probability \(1- p\) you start over again. Stated recursively
$$
y_{n} = y_{n-1} + 1 + (1-p)y_{n}\\
py_{n} = y_{n-1} + 1
$$
Using the above expression, it is easy to derive the general expression for \(y_{n}\) as follows, keeping in mind \(y_{0} = 0\)
$$
y_{1} = \frac{y_{0}}{p} + \frac{1}{p} = \frac{1}{p}\\
y_{2} = \frac{1}{p}(y_{1} + 1) = \frac{1}{p^{2}} + \frac{1}{p}
y_{3} = \frac{1}{p}(y_{2} + 1) = \frac{1}{p^{3}} + \frac{1}{p^{2}} + \frac{1}{p}
$$
Being a sum of a geometric series, the \(y_{n}\) can be evaluated to
$$
y_{n} = \frac{1}{1-p}\big(\frac{1}{p^{n}} - 1\big)
$$
Now, back to the original question. Assume the apprentice waits \(k\) minutes before no customer shows up and he chooses to shut the shop. The situation is "similar" to the coin tossing and awaiting for a string of heads. (Note: Strictly speaking, it's similar only in the limiting case when the time interval is very small). In this case each "head" signifies the absence of a customer showing up in a minute. As the customers arrive uniformly at random we can assume they follow a Poisson process. The probability that \(m\) customers arrive in a one minute window if the rate parameter is \(\lambda\) (in this case \(\lambda = 1min^{-1}\)) is

$$
P(m,\lambda) = \frac{(\lambda)^{m}e^{-\lambda}}{m!}
$$

When \(m =0\) we get \(p = e^{-\lambda}\). Plugging this value back to our equation for \(y_{n}\) we get
$$
y_{n} = \frac{1}{1 - e^{-\lambda}}\big(e^{k\lambda} - 1\big)
$$
for small \(\lambda\) the denominator of the first part of the equation can be approximated as \(\frac{1}{\lambda}\) yielding
$$
y_{n} = \frac{1}{\lambda}\big(e^{k\lambda} - 1\big)
$$

Note, if you plug \(k=6\) into the above equation you get \(\approx 7\) hours whereas for \(k=5\) you get \(\approx 2.5\) hours. Due to the exponential connection between \(y_{n}\) and \(k\) the values for \(y_{n}\) are super sensitive to changes in \(k\).

If you are looking to buy some books in probability here are some of the best books to learn the art of Probability