This write up is less of a puzzle and more a tutorial. Without further ado, here you go.
The Restricted Boltzmann Machine (RBM) has become increasingly popular of late after its success in the Netflix prize competition and other competitions. Most of the inventive work behind RBMs was done by Geoffrey Hinton. In particular the training of RBMs using an algorithm called "Contrastive Divergence" (CD). CD is very similar to gradient descent. A good consequence of the CD is its ability to "dream". Of the various machine learning methods out there, the RBM is the only one which has this capacity baked in implicitly.
Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series)
Energy Based Models:
The RBM is an energy based model. If we have a training data where we have two classes to classify, the RBM during its training phase, tries to assign a lower "energy" to a class than other classes. It achieves this by introducing an architecture of visible and hidden units. This is shown in the figure below.
There are also bias units but I've excluded them from the diagram above to demonstrate the uniqueness of RBMs. Notice, the visible units do not have any links amongst them. The hidden units also do not have connections amongst them. This is where the "Restricted" in name RBM comes from. They are restricted in whom they are connected with. The architecture, in that sense, is fixed. Each visible unit connects to all hidden units and each hidden unit connects to all visible units. Now onto to the next logical question. What are these units? The visible units are simply floating/integer value numbers. They can be represented by any vector or matrix of numbers. They are the gateway through which we enter training examples and make predictions/dreams etc. The hidden units are like neurons in a typical neural network. They aggregate all the data through the visible units. What about those connecting lines? They represent weights by which each each visible unit is connected to a hidden unit. The hidden units are sigmoid functions which take the form
$$
P(x) = \frac{1}{1 + e^{-x}}
$$
If \(P(x)\) is \(\gt 0.5\) then you can think of these units as "lighting up".
Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
Training & Learning:
In order to train an RBM we follow the CD algorithm. Note, RBMs are known to work best with binary input features and if we want to do classification of some sort. You could still do regression or deal with real valued features, but that is beyond the scope of this write up. For more information you can read up this excellent write up by Geoffrey Hinton. The CD algorithm works by initializing the weights to random values. Next we give the visible units a training example by turning on (setting to 1) the visible units which have the feature and compute how the hidden units light up. Each of the hidden units get a flow of values come in from each visible unit and they are aggregated by their corresponding weights. For the \(i^{th}\) hidden unit, if \(w_{ij}\) represents all the weights pointing towards it, the total value coming into it would be
$$
\sum_{j}^{N}w_{ij}v_{j}
$$
where \(N\) is the number of visible units. This causes some of the hidden units to light up. This is what gets called as "positive phase" (it doesn't matter if it is called negative phase, its just a name). Next, using the lit up hidden units, the visible units are recreated. This step is crucial to understand. Also note that the first time this is done, the visible units most likely will not look anything like the set you just gave it. However, based on how different the visible units look to the ones that you just gave it, the weights are adjusted. The degree to which this is done is decided by the learning rate, typically set to some small number. This method is very similar to gradient ascent. The smaller you set the learning rate at, the better the learning and the longer it would take to eventually finish. The entire process is repeated for the next training example and then the following one and so on until all examples are completed. All of this counts as one iteration. You do several hundreds of these iterations until the visible units that are generated by the RBM during the negative phase closely resemble the input ones during the positive phase.
Example
The following is an example implementation of the RBM done in R. I've tried to make it slightly more efficient by using R's built in apply functions This is by no means complete but should hopefully serve as an inspiration for someone to build out a package for this method in R and put that on CRAN.
Before getting into the example implementation, here is an example dataset I have created. It has just two classes of examples. They are several instances of an arrow pointing up and an arrow pointing down.
An example up arrow would look like the following. Notice, the ascii art :)
0,0,0,0,1,0,0,0,0 0,0,0,1,1,1,0,0,0 0,0,1,1,1,1,1,0,0 0,1,1,1,1,1,1,1,0 1,1,1,1,1,1,1,1,1
All the rows are merged to one long row for training purposes. Similarly a down arrow would look like the following.
1,1,1,1,1,1,1,1,1 0,1,1,1,1,1,1,1,0 0,0,1,1,1,1,1,0,0 0,0,0,1,1,1,0,0,0 0,0,0,0,1,0,0,0,0
To create a training set, simply flip a few 1s to 0s at random and create equal instances of each.
The R code for the RBM is shown below.
The R Book
act.f <- function(a){ p = 1/(1 + exp(-a)) ifelse(p > runif(1,0,1),return(1),return(0)) } dream <- function(xx,w,nc){ a1 = w %*% xx hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=nc) pos.e = hidden.state %*% t(xx) a1 = t(w) %*% hidden.state visible.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1) return(visible.state) } update.wgt <- function(xx,w,nc){ a1 = w %*% xx hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=nc) pos.e = hidden.state %*% t(xx) a1 = t(w) %*% hidden.state visible.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1) a1 = w %*% visible.state hidden.state = as.matrix(apply(a1,c(1,2),act.f),nrow=1) neg.e = hidden.state %*% t(visible.state) pos.e.iter <<- pos.e.iter + pos.e neg.e.iter <<- neg.e.iter + neg.e } x = read.table("arrow.csv",sep=",",header=F) x = cbind(rep(1,nrow(x)),x) x = as.matrix(x[,1:(length(x))],ncol=length(x)) num.hiddenstates = 3 w = matrix(runif(ncol(x)*num.hiddenstates,-1,1),nrow=num.hiddenstates) system.time({ for(iter in seq(1:2000)){ pos.e.iter = matrix(rep(0,ncol(x)*num.hiddenstates),nrow=num.hiddenstates) neg.e.iter = matrix(rep(0,ncol(x)*num.hiddenstates),nrow=num.hiddenstates) apply(x,1,update.wgt,w,num.hiddenstates) w = w + 0.01*(pos.e.iter - neg.e.iter) } }) print(t(w)) x.test1 = as.matrix(c(1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0),ncol=1) x.test2 = as.matrix(c(1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,1),ncol=1) a1 = w %*% x.test1 a1 a1 = w %*% x.test2 a1 d.vector = dream(x.test2,w,3) d.vector = d.vector[-1] d.state = matrix(d.vector,byrow=TRUE,nrow=5) print(d.state) d.vector = dream(x.test2,w,3) d.vector = d.vector[-1] d.state = matrix(d.vector,byrow=TRUE,nrow=5) print(d.state)
The above code spends most of its time in the training phase. Also included are two example cases. The number of iterations are hard wired into this code segment at 2000 for simplicity. Three hidden units are chosen but you can change that by changing the variable num.hiddenstates. x.test1 and x.test2 are two matrices which have these example cases put in. The output of the hidden node values while under the two test cases are also written out. When you run this code, notice that the value of a1 for both cases will always be different when viewed in combination. Due to the random nature of the hidden units you would not get the same numbers for the hidden units every time you run this code. This is important to understand and note. A typical (and a good) way RBMs are used is to use this output as input to another classifier (a neural network or decision tree say) which in turn does the prediction. Such architectures are called "deep learning" architectures.
Dreaming
The final leg of the RBM code does a cool trick. It can "dream"! This simply means that the hidden units are used to regenerate the visible units. To seed the dreaming process we give it one example from the training set (in the code I give it the test set itself) and see what units in the hidden layer get activated. Then the negative phase of the CD algorithm is run to regenerate the visible units. This can be done as many different times as one pleases. I've set it to dream up just two cases each of an ASCII art up-arrow. Here is how it showed up on my machine.
[1,] 0 0 0 0 1 0 0 0 0 [2,] 0 0 0 1 1 1 0 0 0 [3,] 0 0 1 1 0 1 1 0 0 [4,] 0 1 1 1 1 1 1 1 0 [5,] 1 1 1 0 1 1 0 1 1
[1,] 0 0 0 0 1 0 0 0 0 [2,] 0 0 1 1 1 1 0 0 0 [3,] 0 0 1 0 0 1 1 0 0 [4,] 0 1 1 1 1 1 1 1 0 [5,] 1 1 1 1 1 1 1 1 1Hopefully this write up was helpful in gaining a better understanding of this nice machine learning algorithm. If you are looking to learn the art of machine learning here are some good books to own
Bayesian Reasoning and Machine Learning
Ensemble Methods: Foundations and Algorithms (Chapman & Hall/Crc Machine Learnig & Pattern Recognition)
Neural Networks for Pattern Recognition
A good book to understand why RBMs and Neural Networks are able to fit into functions we want to learn.
Recommender Systems Handbook
Collective Intelligence in Action
Hi,
ReplyDeleteThanks for posting this code, I found it really useful. I've recoded it to not use the apply() function and let R call the functions directly with vectors, I got a 10x speedup. Some code fragments:
boltzmann.prob <- function(a) {
1/(1 + exp(-a))
}
# Forward pass, to get hidden unit activations
hidden.prob <- as.vector( boltzmann.prob( visible.state %*% w ) )
hidden.state <- ifelse(hidden.prob > runif(hidden.states,0,1), 1, 0)
Anyway, I should have some code worth sharing in a day or three.
Regards,
Alan.
Great.
Delete