Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Machine learning for package users with R (4): Neural Network

These days almost everybody appears to love a variation of Neural Network (NN) -- Deep Learning. I already argued about how Deep Learning works and what kind of parameters characterizes it in the previous post.



A brief trial on a short version of MNIST datasets


In this post I argue about its origin, usual 3-layers (with only a hidden layer) NN.

> library(nnet)
> train<-read.csv("short_prac_train.csv")
> test<-read.csv("short_prac_test.csv")
> train$label<-as.factor(train$label)
> test$label<-as.factor(test$label)
> train.nnet<-nnet(label~.,train,size=100,decay=0.01,maxit=100,MaxNWts=80000)
# weights:  79510
initial  value 21847.796288 
iter  10 value 5304.553496
iter  20 value 3625.223363
iter  30 value 3286.108136
iter  40 value 2882.645985
iter  50 value 2544.534809
iter  60 value 2426.173862
iter  70 value 2300.468337
iter  80 value 2200.327282
iter  90 value 2100.783283
iter 100 value 2042.220475
final  value 2042.220475 
stopped after 100 iterations
# took mmmmmmmmmmmmuch long time...

> table(test$label,predict(train.nnet,newdata=test[,-1],type='class'))
   
     0  1  2  3  4  5  6  7  8  9
  0 93  0  1  0  0  2  2  0  2  0
  1  0 93  1  0  0  2  0  2  2  0
  2  3  3 83  3  1  0  1  2  4  0
  3  2  0  3 79  0  7  1  2  4  2
  4  0  1  1  0 86  2  3  0  1  6
  5  2  0  0  6  2 79  4  1  4  2
  6  2  0  3  0  4  3 87  0  0  1
  7  1  1  1  0  1  0  0 88  2  6
  8  0  0  4  2  1  8  0  2 81  2
  9  0  0  1  2  6  0  0  4  4 83
> sum(diag(table(test$label,predict(train.nnet,newdata=test[,-1],type='class'))))/nrow(test)
[1] 0.852 # OMG, it's low...


Not surprisingly, it's accuracy is low :P) I understand, in principle, NN is hard to be tuned and so that some people call it "a kind of master piece", that means only "masters" can tune it well.


Algorithm summary


As well known, NN is a perceptron with a hidden layer and "back propagation". Very fortunately, the famous textbook ESL clearly explains its algorithm at p.396 as below.

Here is back-propagation in detail for squared error loss. Let z_{mi} = \sigma(\alpha_{0m} + \alpha^T_m x_i), from (11.5) and let  z_i = (z_{1i}, z_{2i}, \cdots, z_{Mi}). Then we have


 R(\theta) \equiv \displaystyle \sum^N_{i=1} R_i

 = \displaystyle \sum^N_{i=1} \sum^K_{k=1} (y_{ik} - f_k (x_i))^2 (11.11)


with derivatives


 \frac{\partial R_i}{\partial \beta_{km}} = -2 (y_{ik} - f_k (x_i)) g_k' (\beta^T_k z_i) z_{mi} ,

 \frac{\partial R_i}{\partial \alpha_{ml}} = - \displaystyle \sum^K_{k=1} 2 (y_{ik} - f_k (x_i)) g_k' (\beta^T_k z_i) \beta_{km} \sigma' (\alpha^T_m x_i) x_{il} (11.12)


Given these derivatives, a gradient descent update at the (r + 1)st iteration has the form


 \beta^{(r+1)}_{km} = \beta^{(r)}_{km} - \gamma_r \displaystyle \sum^N_{i=1} \frac{\partial R_i}{\partial \beta^{(r)}_{km}},

 \alpha^{(r+1)}_{ml} = \alpha^{(r)}_{ml} - \gamma_r \displaystyle \sum^N_{i=1} \frac{\partial R_i}{\partial \alpha^{(r)}_{ml}}, (11.13)


where \gamma_r is the learning rate, discussed below.


Now write (11.12) as


 \frac{\partial R_i}{\partial \beta_{km}} = \delta_{ki} z_{mi},

 \frac{\partial R_i}{\partial \alpha_{ml}} = s_{mi} x_{il}. (11.14)


The quantities \delta_{ki} and s_{mi} are "errors" from the current model at the output and hidden layer units, respectively. From their definitions, these errors satisfy


 s_{mi} = \sigma' (\alpha^T_m x_i) \displaystyle \sum^K_{k=1} \beta_{km} \delta_{ki}, (11.15)


known as the back-propagation equations. Using this, the updates in (11.13) can be implemented with a two-pass algorithm. In the forward pass, the current weights are fixed and the predicted values \hat{f_k}(x_i) are computed from formula (11.5). In the backward pass, the errors \delta_{ki} are computed, and then back-propagated via (11.15) to give the errors s_{mi}. Both sets of errors are then used to compute the gradients for the updates in (11.13), via (11.14).

f:id:TJO:20131213162022p:plain:w300

(wikipedia:en:Artificial_neural_networks)


This algorithm summary says that NN is rather an iterative optimizing procedure than "network". I guess many people regard NN as "network" but from the viewpoint of its algorithm it is merely a two-pass algorithm. This view will be important for understanding how Deep Learning works, as an extension of NN.


How it works on XOR patterns and linearly separable patterns

XOR patterns


OK, let's try it on XOR patterns. I think you already download simple and complex XOR patterns from my GitHub repository. First let's run a parameter tuning routine. Import {nnet} and {caret} packages and run as below.

> library(nnet)
> library(caret)
> xors<-read.table("xor_simple.txt",header=T)
> xors$label<-as.factor(xors$label-1)
> xorc<-read.table("xor_complex.txt",header=T)
> xorc$label<-as.factor(xorc$label-1)

# Tune for simple XOR pattern
> xors.tune<-train(label~.,data=xors,method="nnet",tuneLength=4,maxit=100,trace=F)
Loading required package: class
> print(xors.tune)
Neural Network 

100 samples
  2 predictors
  2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 

Resampling results across tuning parameters:

  size  decay    Accuracy  Kappa  Accuracy SD  Kappa SD
  1     0        0.659     0.326  0.06         0.099   
  1     1e-04    0.65      0.283  0.0466       0.0892  
  1     0.00316  0.645     0.284  0.0544       0.0934  
  1     0.1      0.643     0.296  0.0611       0.105   
  3     0        0.892     0.784  0.0773       0.149   
  3     1e-04    0.888     0.772  0.101        0.206   
  3     0.00316  0.924     0.846  0.0732       0.146   
  3     0.1      0.923     0.845  0.0694       0.139   
  5     0        0.913     0.825  0.0512       0.097   
  5     1e-04    0.933     0.865  0.0403       0.0802  
  5     0.00316  0.951     0.901  0.0405       0.0803  
  5     0.1      0.963     0.925  0.0304       0.0612  
  7     0        0.925     0.849  0.0394       0.0782  
  7     1e-04    0.951     0.901  0.0433       0.086   
  7     0.00316  0.95      0.9    0.0431       0.0846  
  7     0.1      0.961     0.921  0.0328       0.0657  

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were size = 5 and decay = 0.1. 

# Tune for complex XOR pattern
> xorc.tune<-train(label~.,data=xorc,method="nnet",tuneLength=4,maxit=100,trace=F)
> print(xorc.tune)
Neural Network 

100 samples
  2 predictors
  2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 

Resampling results across tuning parameters:

  size  decay    Accuracy  Kappa   Accuracy SD  Kappa SD
  1     0        0.521     0.0583  0.0808       0.153   
  1     1e-04    0.509     0.0332  0.0822       0.152   
  1     0.00316  0.531     0.0876  0.0839       0.158   
  1     0.1      0.519     0.0638  0.0879       0.176   
  3     0        0.652     0.302   0.0985       0.193   
  3     1e-04    0.667     0.333   0.0676       0.133   
  3     0.00316  0.678     0.359   0.0808       0.148   
  3     0.1      0.698     0.394   0.0609       0.119   
  5     0        0.677     0.349   0.08         0.161   
  5     1e-04    0.673     0.345   0.0758       0.145   
  5     0.00316  0.678     0.355   0.0628       0.126   
  5     0.1      0.715     0.43    0.079        0.152   
  7     0        0.685     0.371   0.0942       0.184   
  7     1e-04    0.68      0.361   0.067        0.135   
  7     0.00316  0.681     0.363   0.0672       0.131   
  7     0.1      0.715     0.431   0.0745       0.144   

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were size = 7 and decay = 0.1. 


Now we have a set of best parameters for both simple and complex XOR patterns.

# Set a grid
> px<-seq(-4,4,0.03)
> py<-seq(-4,4,0.03)
> pgrid<-expand.grid(px,py)
> names(pgrid)<-names(xors)[-3]

# Simple XOR pattern
> plot(c(),type='n',xlim=c(-4,4),ylim=c(-4,4))
> par(new=T)
> rect(0,0,4,4,col='#aaaaff')
> par(new=T)
> rect(-4,0,0,4,col='#ffaaaa')
> par(new=T)
> rect(-4,-4,0,0,col='#aaaaff')
> par(new=T)
> rect(0,-4,4,0,col='#ffaaaa')
> par(new=T)
> plot(xors[,-3],col=c(rep('blue',50),rep('red',50)),xlim=c(-4,4),ylim=c(-4,4),pch=19,cex=2.5)
> par(new=T)
> contour(px,py,array(predict(xors.nnet,newdata=pgrid),dim = c(length(px),length(py))),levels = 0.5,drawlabels = T,col='purple',lwd=5,xlim=c(-4,4),ylim=c(-4,4))

# Complex XOR pattern
> plot(c(),type='n',xlim=c(-4,4),ylim=c(-4,4))
> par(new=T)
> rect(0,0,4,4,col='#aaaaff')
> par(new=T)
> rect(-4,0,0,4,col='#ffaaaa')
> par(new=T)
> rect(-4,-4,0,0,col='#aaaaff')
> par(new=T)
> rect(0,-4,4,0,col='#ffaaaa')
> par(new=T)
> plot(xorc[,-3],col=c(rep('blue',50),rep('red',50)),xlim=c(-4,4),ylim=c(-4,4),pch=19,cex=2.5)
> par(new=T)
> contour(px,py,array(predict(xorc.nnet,newdata=pgrid),dim = c(length(px),length(py))),levels = 0.5,drawlabels = T,col='purple',lwd=5,xlim=c(-4,4),ylim=c(-4,4))

f:id:TJO:20150522125128p:plain:w250 f:id:TJO:20150522125200p:plain:w250


Both look nice. Even in the case of complex XOR pattern, this NN model provides well-smoothed and generalized decision boundaries.


But we can arbitrarily excessively generalize or overfit it to complex XOR pattern with tuning in a bad manner. See below.

# Try to excessively generalize
> xorc.nnet2<-nnet(label~.,xorc,size=2,decay=0.1)
# weights:  9
initial  value 69.974985 
iter  10 value 68.637481
iter  20 value 61.866228
iter  30 value 58.071456
final  value 58.067022 
converged
> plot(c(),type='n',xlim=c(-4,4),ylim=c(-4,4))
> par(new=T)
> rect(0,0,4,4,col='#aaaaff')
> par(new=T)
> rect(-4,0,0,4,col='#ffaaaa')
> par(new=T)
> rect(-4,-4,0,0,col='#aaaaff')
> par(new=T)
> rect(0,-4,4,0,col='#ffaaaa')
> par(new=T)
> plot(xorc[,-3],col=c(rep('blue',50),rep('red',50)),xlim=c(-4,4),ylim=c(-4,4),pch=19,cex=2.5)
> par(new=T)
> contour(px,py,array(predict(xorc.nnet2,newdata=pgrid),dim = c(length(px),length(py))),levels = 0.5,drawlabels = T,col='purple',lwd=5,xlim=c(-4,4),ylim=c(-4,4))

# Try to overfit
> xorc.nnet3<-nnet(label~.,xorc,size=30,decay=0.01,maxit = 500)
# weights:  121
initial  value 72.772471 
iter  10 value 45.654101
iter  20 value 38.793337
iter  30 value 36.433649
iter  40 value 35.542237
iter  50 value 35.151096
iter  60 value 34.962654
iter  70 value 34.787609
iter  80 value 34.711248
iter  90 value 34.677120
iter 100 value 34.647901
iter 110 value 34.627443
iter 120 value 34.617480
iter 130 value 34.611521
iter 140 value 34.607867
iter 150 value 34.605730
iter 160 value 34.604129
iter 170 value 34.602038
iter 180 value 34.600587
iter 190 value 34.599829
iter 200 value 34.599381
iter 210 value 34.599158
iter 220 value 34.599070
iter 230 value 34.598967
iter 240 value 34.598846
iter 250 value 34.598784
iter 260 value 34.598687
iter 270 value 34.598562
iter 280 value 34.598544
final  value 34.598544 
converged
> plot(c(),type='n',xlim=c(-4,4),ylim=c(-4,4))
> par(new=T)
> rect(0,0,4,4,col='#aaaaff')
> par(new=T)
> rect(-4,0,0,4,col='#ffaaaa')
> par(new=T)
> rect(-4,-4,0,0,col='#aaaaff')
> par(new=T)
> rect(0,-4,4,0,col='#ffaaaa')
> par(new=T)
> plot(xorc[,-3],col=c(rep('blue',50),rep('red',50)),xlim=c(-4,4),ylim=c(-4,4),pch=19,cex=2.5)
> par(new=T)
> contour(px,py,array(predict(xorc.nnet3,newdata=pgrid),dim = c(length(px),length(py))),levels = 0.5,drawlabels = T,col='purple',lwd=5,xlim=c(-4,4),ylim=c(-4,4))

f:id:TJO:20150522125609p:plain:w250 f:id:TJO:20150522125619p:plain:w250


Hehe, looks crazy! :P) You can well find how NN can be easily tuned worse. So one of my previous colleagues who is an outstanding expert of NN told me "NN is a MASTER PIECE". Yes I agree it.


Linearly separable patterns

> dbi<-read.table("linear_bi.txt",header=T)
> dbi$label<-as.factor(dbi$label)
> summary(dbi)
       x                y         label 
 Min.   :0.2112   Min.   :0.177   0:25  
 1st Qu.:1.0260   1st Qu.:1.133   1:25  
 Median :1.5830   Median :1.616         
 Mean   :1.5786   Mean   :1.606         
 3rd Qu.:2.0201   3rd Qu.:2.242         
 Max.   :3.2749   Max.   :2.994         
> dml<-read.table("linear_multi.txt",header=T)
> dml$label<-as.factor(dml$label)
> px2<-seq(-0.5,4.5,0.03)
> py2<-seq(-0.5,4.5,0.03)
> pgrid2<-expand.grid(px2,py2)
> names(pgrid2)<-names(dbi)[-3]
> dbi.tune<-train(label~.,data=dbi,method="nnet",tuneLength=4,maxit=100,trace=F)
> print(dbi.tune)
Neural Network 

50 samples
 2 predictors
 2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 

Resampling results across tuning parameters:

  size  decay    Accuracy  Kappa  Accuracy SD  Kappa SD
  1     0        0.884     0.763  0.0766       0.165   
  1     1e-04    0.878     0.751  0.0798       0.171   
  1     0.00316  0.884     0.766  0.0695       0.145   
  1     0.1      0.878     0.756  0.0786       0.16    
  3     0        0.864     0.724  0.0644       0.129   
  3     1e-04    0.874     0.745  0.0722       0.144   
  3     0.00316  0.846     0.688  0.0792       0.16    
  3     0.1      0.881     0.762  0.0735       0.147   
  5     0        0.841     0.676  0.0874       0.179   
  5     1e-04    0.834     0.666  0.0984       0.198   
  5     0.00316  0.831     0.66   0.0797       0.159   
  5     0.1      0.874     0.746  0.0853       0.172   
  7     0        0.854     0.705  0.077        0.152   
  7     1e-04    0.829     0.653  0.0894       0.18    
  7     0.00316  0.818     0.634  0.0904       0.18    
  7     0.1      0.874     0.746  0.0853       0.172   

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were size = 1 and decay = 0.00316. 
> dml.tune<-train(label~.,data=dml,method="nnet",tuneLength=4,maxit=100,trace=F)
> print(dml.tune)
Neural Network 

75 samples
 2 predictors
 3 classes: '0', '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 75, 75, 75, 75, 75, 75, ... 

Resampling results across tuning parameters:

  size  decay    Accuracy  Kappa  Accuracy SD  Kappa SD
  1     0        0.81      0.711  0.0948       0.145   
  1     1e-04    0.828     0.74   0.0582       0.0855  
  1     0.00316  0.835     0.75   0.0568       0.0842  
  1     0.1      0.784     0.678  0.078        0.108   
  3     0        0.823     0.731  0.051        0.0763  
  3     1e-04    0.812     0.711  0.0601       0.0978  
  3     0.00316  0.82      0.728  0.0728       0.114   
  3     0.1      0.835     0.75   0.0544       0.0813  
  5     0        0.801     0.701  0.0701       0.103   
  5     1e-04    0.81      0.713  0.0601       0.0906  
  5     0.00316  0.802     0.7    0.0646       0.102   
  5     0.1      0.83      0.743  0.0494       0.0737  
  7     0        0.79      0.684  0.0752       0.113   
  7     1e-04    0.785     0.676  0.066        0.0974  
  7     0.00316  0.804     0.705  0.0629       0.0961  
  7     0.1      0.826     0.738  0.0575       0.0841  

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were size = 3 and decay = 0.1. 

# Fit NNs
> dbi.nnet<-nnet(label~.,dbi,size=1,decay=dbi.tune$bestTune$decay)
# weights:  5
initial  value 35.776034 
iter  10 value 13.686940
iter  20 value 10.696310
iter  30 value 10.639279
iter  40 value 10.636601
final  value 10.636598 
converged
> dml.nnet<-nnet(label~.,dml,size=3,decay=dml.tune$bestTune$decay)
# weights:  21
initial  value 96.421336 
iter  10 value 47.021366
iter  20 value 38.167857
iter  30 value 37.854914
iter  40 value 37.735454
iter  50 value 37.730520
iter  60 value 37.729962
final  value 37.729953 
converged

# 2-classes
> plot(c(),type='n',xlim=c(-0.5,3.5),ylim=c(-0.5,3.5))
> par(new=T)
> polygon(c(-0.5,-0.5,3.5),c(3.5,-0.5,-0.5),col='#aaaaff')
> par(new=T)
> polygon(c(-0.5,3.5,3.5),c(3.5,-0.5,3.5),col='#ffaaaa')
> par(new=T)
> plot(dbi[,-3],pch=19,col=c(rep('blue',25),rep('red',25)),cex=3,xlim=c(-0.5,3.5),ylim=c(-0.5,3.5))
> par(new=T)
> contour(px2,py2,array(predict(dbi.nnet,newdata=pgrid2),dim = c(length(px2),length(py2))),xlim=c(-0.5,3.5),ylim=c(-0.5,3.5),lwd=6,col='purple',levels = 0.5,drawlabels = T)

# 3-classes
> plot(c(),type='n',xlim=c(-0.5,4.5),ylim=c(-0.5,4.5))
> par(new=T)
> polygon(c(-0.5,-0.5,3.5),c(3.5,-0.5,-0.5),col='#aaaaff')
> par(new=T)
> polygon(c(-0.5,3.5,4.5,4.5,1.0,-0.5),c(3.5,-0.5,-0.5,1.0,4.5,4.5),col='#ffaaaa')
> par(new=T)
> polygon(c(1.0,4.5,4.5),c(4.5,1.0,4.5),col='#ccffcc')
> par(new=T)
> plot(dml[,-3],pch=19,col=c(rep('blue',25),rep('red',25),rep('green',25)),cex=3,xlim=c(-0.5,4.5),ylim=c(-0.5,4.5))
> par(new=T)
> contour(px2,py2,array(predict(dml.nnet,newdata=pgrid2,type='class'),dim=c(length(px2),length(py2))),xlim=c(-0.5,4.5),ylim=c(-0.5,4.5),col="purple",lwd=6,drawlabels=T)

f:id:TJO:20150522171518p:plain:w250 f:id:TJO:20150522171533p:plain:w250


It's confirmed well-tuned NNs successfully draw decision boundaries along the assumed true ones. One of the advantages of NN is, I think, that it's extended from linear perceptron so it can also handle usual linearly separable patterns if correctly tuned.


Conclusions


Lessons we learned here are:

  1. NN with a single hidden layer is just an iterative optimized learning, rather than "network"
  2. Performance of NN can be easily affected by parameter tunings
  3. But NN also work well for both linearly separable and non-separable patterns


I feel these characteristics are taken over by Deep Learning and its family. In that sense, I think we have to seriously review NN with a single hidden layer.


Appendix


h2o.deeplearning() performed as below.

> library(h2o)
> localH2O <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, nthreads=7)
> trData<-h2o.importFile(localH2O,path = "short_prac_train.csv")
> tsData<-h2o.importFile(localH2O,path = "short_prac_test.csv")
> res.dl <- h2o.deeplearning(x = 2:785, y = 1, data = trData, activation = "RectifierWithDropout",hidden=c(1024,1024,2048),epochs = 300, adaptive_rate = FALSE, rate=0.01, rate_annealing = 1.0e-6,rate_decay = 1.0, momentum_start = 0.5,momentum_ramp = 5000*18, momentum_stable = 0.99, input_dropout_ratio = 0.2,l1 = 1.0e-5,l2 = 0.0,max_w2 = 15.0, initial_weight_distribution = "Normal",initial_weight_scale = 0.01,nesterov_accelerated_gradient = T, loss = "CrossEntropy", fast_mode = T, diagnostics = T, ignore_const_cols = T,force_load_balance = T)
> pred.dl<-h2o.predict(object=res.dl,newdata=tsData[,-1])
> pred.dl.df<-as.data.frame(pred.dl)
> table(test$label,pred.dl.df[,1])
   
     0  1  2  3  4  5  6  7  8  9
  0 95  0  1  1  0  0  3  0  0  0
  1  0 99  1  0  0  0  0  0  0  0
  2  0  0 97  0  1  0  0  1  0  1
  3  0  0  1 96  0  2  0  1  0  0
  4  0  1  1  0 96  0  2  0  0  0
  5  0  2  0  3  0 94  1  0  0  0
  6  1  0  0  0  0  1 98  0  0  0
  7  0  0  0  0  3  0  0 97  0  0
  8  0  0  1  3  1  2  0  1 92  0
  9  0  0  0  0  4  0  0  2  0 94
> sum(diag(table(test$label,pred.dl.df[,1])))/nrow(test)
[1] 0.958


Deep Learning worked a little better than random forest, as supposed to be shown in the next post. :D)