Skip to main content

The Restricted Boltzmann Machine (RBM) and Dreams




This write up is less of a puzzle and more a tutorial. Without further ado, here you go.

The Restricted Boltzmann Machine (RBM) has become increasingly popular of late after its success in the Netflix prize competition and other competitions. Most of the inventive work behind RBMs was done by Geoffrey Hinton. In particular the training of RBMs using an algorithm called "Contrastive Divergence" (CD). CD is very similar to gradient descent. A good consequence of the CD is its ability to "dream". Of the various machine learning methods out there, the RBM is the only one which has this capacity baked in implicitly.

Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series)

Energy Based Models:

The RBM is an energy based model. If we have a training data where we have two classes to classify, the RBM during its training phase, tries to assign a lower "energy" to a class than other classes. It achieves this by introducing an architecture of visible and hidden units. This is shown in the figure below.

There are also bias units but I've excluded them from the diagram above to demonstrate the uniqueness of RBMs. Notice, the visible units do not have any links amongst them. The hidden units also do not have connections amongst them. This is where the "Restricted" in name RBM comes from. They are restricted in whom they are connected with. The architecture, in that sense, is fixed. Each visible unit connects to all hidden units and each hidden unit connects to all visible units. Now onto to the next logical question. What are these units? The visible units are simply floating/integer value numbers. They can be represented by any vector or matrix of numbers. They are the gateway through which we enter training examples and make predictions/dreams etc. The hidden units are like neurons in a typical neural network. They aggregate all the data through the visible units. What about those connecting lines? They represent weights by which each each visible unit is connected to a hidden unit. The hidden units are sigmoid functions which take the form
$$
P(x) = \frac{1}{1 + e^{-x}}
$$
If \(P(x)\) is \(\gt 0.5\) then you can think of these units as "lighting up".
Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)

Training & Learning:

In order to train an RBM we follow the CD algorithm. Note, RBMs are known to work best with binary input features and if we want to do classification of some sort. You could still do regression or deal with real valued features, but that is beyond the scope of this write up. For more information you can read up this excellent write up by Geoffrey Hinton. The CD algorithm works by initializing the weights to random values. Next we give the visible units a training example by turning on (setting to 1) the visible units which have the feature and compute how the hidden units light up. Each of the hidden units get a flow of values come in from each visible unit and they are aggregated by their corresponding weights. For the \(i^{th}\) hidden unit, if \(w_{ij}\) represents all the weights pointing towards it, the total value coming into it would be
$$
\sum_{j}^{N}w_{ij}v_{j}
$$
where \(N\) is the number of visible units. This causes some of the hidden units to light up. This is what gets called as "positive phase" (it doesn't matter if it is called negative phase, its just a name). Next, using the lit up hidden units, the visible units are recreated. This step is crucial to understand. Also note that the first time this is done, the visible units most likely will not look anything like the set you just gave it. However, based on how different the visible units look to the ones that you just gave it, the weights are adjusted. The degree to which this is done is decided by the learning rate, typically set to some small number. This method is very similar to gradient ascent. The smaller you set the learning rate at, the better the learning and the longer it would take to eventually finish. The entire process is repeated for the next training example and then the following one and so on until all examples are completed. All of this counts as one iteration. You do several hundreds of these iterations until the visible units that are generated by the RBM during the negative phase closely resemble the input ones during the positive phase.

Example

The following is an example implementation of the RBM done in R. I've tried to make it slightly more efficient by using R's built in apply functions This is by no means complete but should hopefully serve as an inspiration for someone to build out a package for this method in R and put that on CRAN.

Before getting into the example implementation, here is an example dataset I have created. It has just two classes of examples. They are several instances of an arrow pointing up and an arrow pointing down.

An example up arrow would look like the following. Notice, the ascii art :)

0,0,0,0,1,0,0,0,0
0,0,0,1,1,1,0,0,0
0,0,1,1,1,1,1,0,0
0,1,1,1,1,1,1,1,0
1,1,1,1,1,1,1,1,1

All the rows are merged to one long row for training purposes. Similarly a down arrow would look like the following.

1,1,1,1,1,1,1,1,1
0,1,1,1,1,1,1,1,0
0,0,1,1,1,1,1,0,0
0,0,0,1,1,1,0,0,0
0,0,0,0,1,0,0,0,0

To create a training set, simply flip a few 1s to 0s at random and create equal instances of each.

The R code for the RBM is shown below.

The R Book

act.f <- function(a){
  p = 1/(1 + exp(-a))
  ifelse(p > runif(1,0,1),return(1),return(0))
}
 
dream <- function(xx,w,nc){
  a1 = w %*% xx
  hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=nc)
  pos.e = hidden.state %*% t(xx)
  a1 = t(w) %*% hidden.state
  visible.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1)
  return(visible.state)
}
 
 
update.wgt <- function(xx,w,nc){
  a1 = w %*% xx
  hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=nc)
  pos.e = hidden.state %*% t(xx)
  a1 = t(w) %*% hidden.state
 
  visible.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1)
  a1 = w %*% visible.state
 
  hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1)
  neg.e = hidden.state %*% t(visible.state)
  pos.e.iter <<- pos.e.iter + pos.e
  neg.e.iter <<- neg.e.iter + neg.e
}
 
x = read.table("arrow.csv",sep=",",header=F)
x = cbind(rep(1,nrow(x)),x)
x = as.matrix(x[,1:(length(x))],ncol=length(x))
num.hiddenstates = 3
w = matrix(runif(ncol(x)*num.hiddenstates,-1,1),nrow=num.hiddenstates)
 
system.time({
  for(iter in seq(1:2000)){
    pos.e.iter = matrix(rep(0,ncol(x)*num.hiddenstates),nrow=num.hiddenstates)
    neg.e.iter = matrix(rep(0,ncol(x)*num.hiddenstates),nrow=num.hiddenstates)
    apply(x,1,update.wgt,w,num.hiddenstates)
    w = w + 0.01*(pos.e.iter - neg.e.iter)
  }
})
print(t(w))
 
 
x.test1 = as.matrix(c(1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0),ncol=1)
x.test2 = as.matrix(c(1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,1),ncol=1)
 
a1 = w %*% x.test1
a1
 
a1 = w %*% x.test2
a1
 
d.vector = dream(x.test2,w,3)
d.vector = d.vector[-1]
d.state = matrix(d.vector,byrow=TRUE,nrow=5)
print(d.state)
d.vector = dream(x.test2,w,3)
d.vector = d.vector[-1]
d.state = matrix(d.vector,byrow=TRUE,nrow=5)
print(d.state)


The above code spends most of its time in the training phase. Also included are two example cases. The number of iterations are hard wired into this code segment at 2000 for simplicity. Three hidden units are chosen but you can change that by changing the variable num.hiddenstates. x.test1 and x.test2 are two matrices which have these example cases put in. The output of the hidden node values while under the two test cases are also written out. When you run this code, notice that the value of a1 for both cases will always be different when viewed in combination. Due to the random nature of the hidden units you would not get the same numbers for the hidden units every time you run this code. This is important to understand and note. A typical (and a good) way RBMs are used is to use this output as input to another classifier (a neural network or decision tree say) which in turn does the prediction. Such architectures are called "deep learning" architectures.

Dreaming

The final leg of the RBM code does a cool trick. It can "dream"! This simply means that the hidden units are used to regenerate the visible units.  To seed the dreaming process we give it one example from the training set (in the code I give it the test set itself) and see what units in the hidden layer get activated. Then the negative phase of the CD algorithm is run to regenerate the visible units. This can be done as many different times as one pleases. I've set it to dream up just two cases each of an ASCII art up-arrow. Here is how it showed up on my machine.

[1,]    0    0    0    0    1    0    0    0    0
[2,]    0    0    0    1    1    1    0    0    0
[3,]    0    0    1    1    0    1    1    0    0
[4,]    0    1    1    1    1    1    1    1    0
[5,]    1    1    1    0    1    1    0    1    1

[1,]    0    0    0    0    1    0    0    0    0
[2,]    0    0    1    1    1    1    0    0    0
[3,]    0    0    1    0    0    1    1    0    0
[4,]    0    1    1    1    1    1    1    1    0
[5,]    1    1    1    1    1    1    1    1    1
Hopefully this write up was helpful in gaining a better understanding of this nice machine learning algorithm. If you are looking to learn the art of machine learning here are some good books to own

Bayesian Reasoning and Machine Learning

Ensemble Methods: Foundations and Algorithms (Chapman & Hall/Crc Machine Learnig & Pattern Recognition)

Neural Networks for Pattern Recognition
A good book to understand why RBMs and Neural Networks are able to fit into functions we want to learn.

Recommender Systems Handbook

Collective Intelligence in Action

Comments

  1. Hi,

    Thanks for posting this code, I found it really useful. I've recoded it to not use the apply() function and let R call the functions directly with vectors, I got a 10x speedup. Some code fragments:

    boltzmann.prob <- function(a) {
    1/(1 + exp(-a))
    }

    # Forward pass, to get hidden unit activations
    hidden.prob <- as.vector( boltzmann.prob( visible.state %*% w ) )
    hidden.state <- ifelse(hidden.prob > runif(hidden.states,0,1), 1, 0)

    Anyway, I should have some code worth sharing in a day or three.

    Regards,
    Alan.

    ReplyDelete

Post a Comment

Popular posts from this blog

The Best Books to Learn Probability

If you are looking to buy some books in probability here are some of the best books to learn the art of Probability The Probability Tutoring Book: An Intuitive Course for Engineers and Scientists (and Everyone Else!) A good book for graduate level classes: has some practice problems in them which is a good thing. But that doesn't make this book any less of buy for the beginner. An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd Edition This is a two volume book and the first volume is what will likely interest a beginner because it covers discrete probability. The book tends to treat probability as a theory on its own Discovering Statistics Using R This is a good book if you are new to statistics & probability while simultaneously getting started with a programming language. The book supports R and is written in a casual humorous way making it an easy read. Great for beginners. Some of the data on the companion website could be missing. Fifty Cha

The Best Books for Linear Algebra

The following are some good books to own in the area of Linear Algebra. Linear Algebra (2nd Edition) This is the gold standard for linear algebra at an undergraduate level. This book has been around for quite sometime a great book to own. Linear Algebra: A Modern Introduction Good book if you want to learn more on the subject of linear algebra however typos in the text could be a problem. Linear Algebra (Dover Books on Mathematics) An excellent book to own if you are looking to get into, or want to understand linear algebra. Please keep in mind that you need to have some basic mathematical background before you can use this book. Linear Algebra Done Right (Undergraduate Texts in Mathematics) A great book that exposes the method of proof as it used in Linear Algebra. This book is not for the beginner though. You do need some prior knowledge of the basics at least. It would be a good add-on to an existing course you are doing in Linear Algebra. Linear Algebra, 4th Ed

Fun with Uniform Random Numbers

Q: You have two uniformly random numbers x and y (meaning they can take any value between 0 and 1 with equal probability). What distribution does the sum of these two random numbers follow? What is the probability that their product is less than 0.5. The Probability Tutoring Book: An Intuitive Course for Engineers and Scientists A: Let z = x + y be the random variable whose distribution we want. Clearly z runs from 0 to 2. Let 'f' denote the uniform random distribution between [0,1]. An important point to understand is that f has a fixed value of 1 when x runs from 0 to 1 and its 0 otherwise. So the probability density for z, call it P(z) at any point is the product of f(y) and f(z-y), where y runs from 0 to 1. However in that range f(y) is equal to 1. So the above equation becomes From here on, it gets a bit tricky. Notice that the integral is a function of z. Let us take a look at how else we can simply the above integral. It is easy to see that f(z-y) = 1 when (