Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Bayesian modeling with R and Stan (2): Installation and an easy example

The previous post overviewed what and how is Stan on R.



Are you ready now? OK, this post reviews how to install Stan. Let's start here! :) In principle this post just follows a content of "RStan Getting Started" but some tips are added in order to fix less known problems.



Warning: this post assumes you are an Windows user. If you use Mac OS or Linux, please see notification for each OS.


Before installing Stan


Just a few days ago, at last {rstan} was registered to CRAN. Congratulations!!!!!! :D)


Indeed since registered to CRAN, installing {rstan} now becomes much easy. But even still now there are a lot of technical points to install it safely. See below.


Rtools and/or gcc is required


For any OS, Stan needs a C++ compiler because Stan is implemented by C++. For R in Windows, CRAN distributes a package of C++ builder and other dependencies as Rtools and here we need it. But Stan project kindly shows us how to install Rtools.



Before installing Rtools, please verify the version of the downloaded Rtools matches the version of your R already installed on your machine.


Verify your gcc version for Stan


After installed Rtools, please run a script below to verify your gcc, C++ compiler, works as expected.

> library(inline) 
> library(Rcpp)
> src <- '
+ std::vector<std::string> s; 
+ s.push_back("hello");
+ s.push_back("world");
+ return Rcpp::wrap(s);
+ '
> hellofun <- cxxfunction(body = src, includes = '', plugin = 'Rcpp', verbose = FALSE)
> cat(hellofun(), '\n') 
hello world


But in some cases it may fail as follows...

> hellofun <- cxxfunction(body = src, includes = '', plugin = 'Rcpp', verbose = FALSE)
C:/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf:196: warning: overriding recipe for target `.m.o'
C:/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf:189: warning: ignoring old recipe for target `.m.o'
filea072b4c44.cpp:1:0: sorry, unimplemented: 64-bit mode not compiled in
make: *** [filea072b4c44.o] Error 1

ERROR(s) during compilation: source code errors or compiler configuration errors!

Program source:
  1: 
  2: // includes from the plugin
  3: 
  4: #include <Rcpp.h>
  5: 
  6: 
  7: #ifndef BEGIN_RCPP
  8: #define BEGIN_RCPP
  9: #endif
 10: 
 11: #ifndef END_RCPP
 12: #define END_RCPP
 13: #endif
 14: 
 15: using namespace Rcpp;
 16: 
 17: 
 18: // user includes
 19: 
 20: 
 21: // declarations
 22: extern "C" {
 23: SEXP filea072b4c44( ) ;
 24: }
 25: 
 26: // definition
 27: 
 28: SEXP filea072b4c44(  ){
 29: BEGIN_RCPP
 30:  
 31: std::vector<std::string> s; 
 32: s.push_back("hello");
 33: s.push_back("world");
 34: return Rcpp::wrap(s);
 35: 
 36: END_RCPP
 37: }
 38: 
 39: 
Error in compileCode(f, code, language = language, verbose = verbose) : 
  Compilation ERROR, function(s)/method(s) not created! C:/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf:196: warning: overriding recipe for target `.m.o'
C:/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf:189: warning: ignoring old recipe for target `.m.o'
filea072b4c44.cpp:1:0: sorry, unimplemented: 64-bit mode not compiled in
make: *** [filea072b4c44.o] Error 1
In addition: Warning message:
running command 'C:/PROGRA~1/R/R-30~1.2/bin/x64/R CMD SHLIB filea072b4c44.cpp 2> filea072b4c44.cpp.err.txt' had status 1 
> cat(hellofun(), '\n')
Error in cat(hellofun(), "\n") : could not find function "hellofun"


OMG, what happens? As far as I've known, this kind of trouble is caused by mismatching of the version of gcc and its PATH.


For many versions of Stan, gcc 4.6.3 or later is recommended; but in some packages of other programming languages, e.g. Python (more in particular Python xy) or MinGW, some older version of gcc is included and usually PATH is edited to set itself prior to the other version of gcc manually installed.


In my own case, MinGW includes an older version of gcc 4.5.2 and its PATH setting prevented me from installing gcc 4.6.3 appropriately. What I had to do was just to remove a folder path below from PATH,

C:\MinGW32-xy\lib\gcc\mingw32\4.5.2

and to add folder paths below.

C:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin


Just for your information, recently I tried to install {rstan} onto an EC2 Amazon Linux instance but it failed. Wrong version of gcc? Any lack of required software or library? No, no, just because of out of memory :( For Amazon EC2, you have to prepare a larger instance.


Get Stan and install it


OK, let's install Stan onto R. Actually {rstan} requires some dependencies such as {Rcpp} and {inline}, but those packages would be installed at the same time with {rstan]. Don't worry.


In the latest version, you only have to run as below.

> Sys.setenv(MAKEFLAGS = "-j4") 
> install.packages("rstan", repos = "http://cran.rstudio.com", dependencies = TRUE)


According to RStan GitHub Wiki, sometimes Windows shows a warning message below and fails.

Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘rstan’
  These will not be installed


If so, please install it from the archive; maybe it's already built.

> install.packages("http://win-builder.r-project.org/N627Ss1f1WNR/rstan_2.7.0.zip")


Even if you still see any error message and fail, please read carefully and fix it. As far as I know, the most popular reason of failed installation is mismatching of versions of dependencies (and/or builders). Just as a usual advice for installing anything on your machine, please verify all software on your machine are updated to the latest version.


Run an easy example of binary logistic regression with {rstan}


Now you can run Stan on R as {rstan} anytime. :) Please make sure {rstan} is ready.

> library(rstan)


OK, let's try it with a very easy example. Please download "conflict_sample.txt" from my GitHub repository and import it as a data frame "dat". This is a really simple data set to be easily modeled by generalized linear modeling, such as logistic regression.



Unfortunately the current version of Stan cannot handle discrete and categorical variables so please convert a "cv" column into numeric.

> dat$cv <- as.numeric(dat$cv)-1

Generate MC samples and get a simple result


Next, prepare a Stan code; in principle {rstan} just works as an interface of Stan for R so it needs a source code to be compiled. Please write a code below and save it as "conflict.stan" with any editor you like.

data {
  int<lower=0> N;
  real x1[N];
  real x2[N];
  real x3[N];
  real x4[N];
  real x5[N];
  real x6[N];
  real x7[N];
  int<lower=0,upper=1> y[N];
}
parameters {
  real b0;
  real b1;
  real b2;
  real b3;
  real b4;
  real b5;
  real b6;
  real b7;
}
model {
  for (n in 1:N) 
      y[n] ~ bernoulli(inv_logit(b0 + b1*x1[n] + b2*x2[n] + b3*x3[n] + b4*x4[n] + b5*x5[n] + b6*x6[n] + b7*x7[n]));
}


This code means merely a simple binary logistic regression. Of course it never requires such a heavy computation with MCMC or HMC in order to estimate a maximum likelihood*1, so please take it as just an easy example.

Actually this code can be re-written using vectorization. It looks further simpler :)

data {
	int<lower=0> N;
	int<lower=0> M;
	matrix[N, M] X;
	int<lower=0, upper=1> y[N];
}
parameters {
	real beta0;
	vector[M] beta;
}
model {
	for (i in 1:N)
		y[i] ~ bernoulli(inv_logit (beta0 + dot_product(X[i] , beta)));
}


Finally, just run as below on R. If you want to make it parallel, please use {foreach} package.

# If you use the former Stan code
> conflict.dat<-list(x1=dat$a1,x2=dat$a2,x3=dat$a3,x4=dat$a4,x5=dat$a5,x6=dat$a6,x7=dat$a7,y=dat$cv,N=3000)
> conflict.fit<-stan(file="conflict.stan",data=conflict.dat,iter=1000,chains=4)

TRANSLATING MODEL 'conflict' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'conflict' NOW.
cygwin warning:
  MS-DOS style path detected: C:/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf
  Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-30~1.2/etc/x64/Makeconf
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
In file included from C:/Program Files/R/R-3.0.2/library/rstan/include/rstan/rstaninc.hpp:3:0,
                 from file11ac1de92a02.cpp:6:
C:/Program Files/R/R-3.0.2/library/rstan/include/rstan/stan_fit.hpp: In constructor 'rstan::stan_fit<Model, RNG>::stan_fit(SEXP, SEXP) [with Model = model11ac79001cdd_conflict_namespace::model11ac79001cdd_conflict, RNG = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]':

...

SAMPLING FOR MODEL 'conflict' NOW (CHAIN 1).
Iteration: 1000 / 1000 [100%]  (Sampling)

SAMPLING FOR MODEL 'conflict' NOW (CHAIN 2).
Iteration: 1000 / 1000 [100%]  (Sampling)

SAMPLING FOR MODEL 'conflict' NOW (CHAIN 3).
Iteration: 1000 / 1000 [100%]  (Sampling)

SAMPLING FOR MODEL 'conflict' NOW (CHAIN 4).
Iteration: 1000 / 1000 [100%]  (Sampling)

# Details can be different from actual outputs depending on version


In principle, running Stan means a kind of Monte Carlo simulation and it takes long time. Please enjoy a cup of coffee until the process finishes. :)


To see its result at a glance, just run as below.

> conflict.fit
Inference for Stan model: conflict.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

       mean se_mean  sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
b0     -1.4     0.0 0.3   -1.9   -1.6   -1.4   -1.2   -0.9  1053    1
b1      1.1     0.0 0.2    0.7    1.0    1.1    1.2    1.4  1355    1
b2     -0.5     0.0 0.2   -0.9   -0.7   -0.6   -0.4   -0.2  1456    1
b3      0.1     0.0 0.2   -0.2    0.0    0.1    0.2    0.4  1425    1
b4     -3.0     0.0 0.2   -3.5   -3.2   -3.0   -2.9   -2.6  1040    1
b5      1.5     0.0 0.2    1.2    1.4    1.5    1.7    1.9  1384    1
b6      5.4     0.0 0.2    5.0    5.2    5.4    5.5    5.8   826    1
b7      0.1     0.0 0.2   -0.3    0.0    0.1    0.2    0.4  1425    1
lp__ -526.3     0.1 2.0 -530.8 -527.4 -526.0 -524.8 -523.4   727    1

Samples were drawn using NUTS2 at Wed Nov 06 12:38:51 2013.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Check convergence with plotting


Usually Bayesian practitioners like to check whether a model estimation converges well or not. If you want so, you have to plot a distribution (histogram) of each estimated parameter. The simplest way is using {coda}.

> library(coda)
> conflict.fit.coda<-mcmc.list(lapply(1:ncol(conflict.fit),function(x) mcmc(as.array(conflict.fit)[,x,])))
> plot(conflict.fit.coda)

f:id:TJO:20131106124829p:plain

f:id:TJO:20131106124839p:plain

f:id:TJO:20131106125651p:plain


You can see b6 and b6 converge well but in particular b4 and b5 are not good. This is just a plotting histogram so you can also use {reshape} and {ggplot2} to draw more sophisticated plots.

> d.ext<-extract(d.fit,permuted=T)
> N.mcmc<-length(d.ext$beta0)
> b1<-d.ext$beta[1:2000]
> b2<-d.ext$beta[2001:4000]
> b3<-d.ext$beta[4001:6000]
> b4<-d.ext$beta[6001:8000]
> b5<-d.ext$beta[8001:10000]
> b6<-d.ext$beta[10001:12000]
> b7<-d.ext$beta[12001:14000]
> bs2<-data.frame(b1=b1,b2=b2,b3=b3,b4=b4,b5=b5,b6=b6,b7=b7)
> bs2.melt<-melt(bs2,id=c(),variable.name="param")
> bs2.qua.melt<-ddply(bs2.melt,.(param),summarize,
+ median=median(value),
+ ymax=quantile(value,prob=0.975),
+ ymin=quantile(value,prob=0.025))
> colnames(bs2.qua.melt)[2]<-"value"

> bs2.melt<-data.frame(bs2.melt,ymax=rep(0,N.mcmc),ymin=rep(0,N.mcmc))
> p<-ggplot(bs2.melt,aes(x=param,y=value,group=param,ymax=ymax,ymin=ymin,color=param))
> p<-p+geom_violin(trim=F,fill="#5B423D",linetype="blank",alpha=I(1/3))
> p<-p+geom_pointrange(data=bs2.qua.melt,size=0.4)
> p<-p+labs(x="",y="")+theme(axis.text.x=element_text(size=11),axis.text.y=element_text(size=11))
> ggsave(file="d7.png",plot=p,dpi=300,width=4,height=3)

f:id:TJO:20140128131312p:plain


It's beautiful but needs a little sophisticated skills of {ggplot2} I think. :P) In the next post, I'll argue about how to use Stan for a dataset with random effects; that is, a simple example of hierarchical Bayesian models.

*1:Yeah, it only requires MLE