Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Bayesian modeling with R and Stan (2): Installation and an easy example

The previous post overviewed what and how is Stan on R.



Are you ready now? OK, this post reviews how to install Stan. Let's start here! :) In principle this post just follows a content of "RStan Getting Started" but some tips are added in order to fix less known problems.



Warning: this post assumes you are an Windows user. If you use Mac OS or Linux, please see notification for each OS.

Read more

Bayesian modeling with R and Stan (1): Overview

Although I've written a series of posts titled "Machine Learning for package uses in R", usually I don't run machine learning on daily analytic works because my current coverage is so-called an ad-hoc analysis.


Instead of machine learning, ad-hoc analysts often use statistical modeling such as linear models (called "multiple regression" in general), generalized linear models (GLM) and/or econometric time series analysis. But in some situations such linear model and its variants would not work because of nonlinear components and/or individual variance, called "random effect".


In general, random effect can be well handled by generalized linear mixed models (GLMM) and for example CRAN has some related packages. But in some cases random effects cannot be formulated concisely and explicitly... if so, we have a strong alternative method to resolve it: "Bayesian using Markov Chain Monte Carlo (MCMC) method".


f:id:TJO:20140128130412p:plain


As one of the strongest methods for ad-hoc analysis, a series of posts will argue about Bayesian modeling with MCMC and its apllication. For the first time, this post overviews it.

Read more

Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.


Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.


But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.


Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.



This dataset is reproduced from "Tennis Major Tournament Match Statistics" dataset distributed by UC Irvine Machine Learning Repository. It contains match results and match stats recorded in 4 major grand slam tournaments of tennis, for both men (ATP) and women (WTA). You'll see some very famous players such as Djokovic, Nadal, Federer and Murray.


Below is a detail of our reproduced datasets:

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2)
FSP.1 First Serve Percentage for player 1 (Real Number)
FSW.1 First Serve Won by player 1 (Real Number)
SSP.1 Second Serve Percentage for player 1 (Real Number)
SSW.1 Second Serve Won by player 1 (Real Number)
ACE.1 Aces won by player 1 (Numeric-Integer)
DBF.1 Double Faults committed by player 1 (Numeric-Integer)
WNR.1 Winners earned by player 1 (Numeric)
UFE.1 Unforced Errors committed by player 1 (Numeric)
BPC.1 Break Points Created by player 1 (Numeric)
BPW.1 Break Points Won by player 1 (Numeric)
NPA.1 Net Points Attempted by player 1 (Numeric)
NPW.1 Net Points Won by player 1 (Numeric)
FSP.2 First Serve Percentage for player 2 (Real Number)
FSW.2 First Serve Won by player 2 (Real Number)
SSP.2 Second Serve Percentage for player 2 (Real Number)
SSW.2 Second Serve Won by player 2 (Real Number)
ACE.2 Aces won by player 2 (Numeric-Integer)
DBF.2 Double Faults committed by player 2 (Numeric-Integer)
WNR.2 Winners earned by player 2 (Numeric)
UFE.2 Unforced Errors committed by player 2 (Numeric)
BPC.2 Break Points Created by player 2 (Numeric)
BPW.2 Break Points Won by player 2 (Numeric)
NPA.2 Net Points Attempted by player 2 (Numeric)
NPW.2 Net Points Won by player 2 (Numeric)


To get easier to analyze them, let's run as below on R.

> dm<-read.delim("men.txt")
> dw<-read.delim("women.txt")
> dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]
> dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]


Here we set a mission as below:

  1. To determine "golden" metrics or to build a model from the men's dataset
  2. To predict women's results from a model given by rules or built with the men's dataset


The result is to be evaluated using confusion matrix.

Read more