Thursday, July 25, 2013

Stirling Numbers of the Second Kind: a Puzzle

Q: An urn contains 20 uniquely identifiable balls. How many draws with replacement needs to be done before you are \(95\%\) sure that you have seen all of them?
A Book on Statistical Inference

A: A benign looking question, but its closed form solution is fairly complex and quite unlike a typical counting puzzle. First lets learn a useful tool: Stirling numbers of the second kind.  It is a complete name and they all go together. It takes in two arguments say \((k,n)\). This is represented as \(S(k,n)\) or also as \({k\brace n }\). It represents the number of subsets of a set of size \(k\) which has a size of \(n\). For example, if a set is given as \({a,b,c}\) then the number of sets of size 2 is \(\{\{a,b\},\{b,c\},\{a,c\}\}\). So we phrase this as
S(3,2) = 3
The generic expression for \(S(k,n)\) is given by
S(k,n) = \frac{1}{n!}\sum_{i=0}^{n}(-1)^{i}{{n}\choose{i}}(n-i)^{n}

The present problem is similar to starting with \(k\) balls, dropping then in \(n\) bins and evaluating the probability that all bins have at least 1 ball. If we can estimate this probability in terms of \(k,n\), then we can set it to \(95\%\) and solve for \(k\) (if it is feasible!).

To start with, each ball can go into any one of the \(n\) bins. So, there are \(n^{k}\) ways to distribute \(k\) balls into \(n\) bins. This is the total number of ways to distribute the balls, the denominator in our probability estimate. The numerator is where things get tricky. Lets say we have just 3 balls that we want to drop off in 2 bins. For a given set of 3 balls, there are \(S(3,2) = 3\) ways to drop them and for each of those ways, there is exactly one way to put them all into the bins such that all bins have at least one ball. So the sought probability can be stated as
P(\text{All Balls Drawn at least once}) = \frac{{k \brace n}}{n^{k}}
This looks good overall as we now have a closed form expression for the probability. We could set it to \(95\%\) and try and solve for \(k\). However, it is quite difficult in practice without a simulation. Instead we can try something different to put an upper bound on the probability and work off that. For this we use the second tool: Markov's Inequality. It states as follows
P(X \ge a) \le \frac{E(X)}{a}
where \(X\) is the number of balls to draw and \(E(X)\) is the expectation of \(X\).
The expectation is easier to compute and think off. From the first pull we will definitely get one unique ball. In the second pull there is a \(\frac{19}{20}\) chance that you will get another unique ball. So the expected number of draws to get two unique balls is \(1 + \frac{20}{19}\). Extending this out, we arrive at the expected number of draws to get at 20 unique balls as
E(X) = 1 + \frac{20}{19} + \frac{20}{18} + \ldots + \frac{20}{1}
Using R, you can compute the above fairly easily as follows

The result yields \(\approx 72\). In order to choose \(a\) we reason as follows: if \(a\) were indeed the number needed to be \(95\%\) sure that all balls have been counted, then the probability of requiring \(\ge a\) would be \(5\%\). Plugging that into the Markov inequality gives us
a \le \frac{72}{0.05} \approx 1440
Note that this is an extremely large bound and not as tight as we would want it to be. This is for a later write up. In the meanwhile you can try and simulate it through using the following R code

which gives an approximate range of 50 to 60 pulls
If you are looking to buy some books in probability here are some of the best books to learn the art of Probability

Fifty Challenging Problems in Probability with Solutions (Dover Books on Mathematics)
This book is a great compilation that covers quite a bit of puzzles. What I like about these puzzles are that they are all tractable and don't require too much advanced mathematics to solve.

Introduction to Algorithms
This is a book on algorithms, some of them are probabilistic. But the book is a must have for students, job candidates even full time engineers & data scientists

Introduction to Probability Theory
Overall an excellent book to learn probability, well recommended for undergrads and graduate students

An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd Edition
This is a two volume book and the first volume is what will likely interest a beginner because it covers discrete probability. The book tends to treat probability as a theory on its own

The Probability Tutoring Book: An Intuitive Course for Engineers and Scientists (and Everyone Else!)
A good book for graduate level classes: has some practice problems in them which is a good thing. But that doesn't make this book any less of buy for the beginner.

Introduction to Probability, 2nd Edition
A good book to own. Does not require prior knowledge of other areas, but the book is a bit low on worked out examples.

Bundle of Algorithms in Java, Third Edition, Parts 1-5: Fundamentals, Data Structures, Sorting, Searching, and Graph Algorithms (3rd Edition) (Pts. 1-5)
An excellent resource (students, engineers and even entrepreneurs) if you are looking for some code that you can take and implement directly on the job

Understanding Probability: Chance Rules in Everyday Life
This is a great book to own. The second half of the book may require some knowledge of calculus. It appears to be the right mix for someone who wants to learn but doesn't want to be scared with the "lemmas"

Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
This one is a must have if you want to learn machine learning. The book is beautifully written and ideal for the engineer/student who doesn't want to get too much into the details of a machine learned approach but wants a working knowledge of it. There are some great examples and test data in the text book too.

Discovering Statistics Using R
This is a good book if you are new to statistics & probability while simultaneously getting started with a programming language. The book supports R and is written in a casual humorous way making it an easy read. Great for beginners. Some of the data on the companion website could be missing.

A Course in Probability Theory, Third Edition
Covered in this book are the central limit theorem and other graduate topics in probability. You will need to brush up on some mathematics before you dive in but most of that can be done online

Probability and Statistics (4th Edition)This book has been yellow-flagged with some issues: including sequencing of content that could be an issue. But otherwise its good

No comments:

Post a Comment