Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.


Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.


But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.


Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.



This dataset is reproduced from "Tennis Major Tournament Match Statistics" dataset distributed by UC Irvine Machine Learning Repository. It contains match results and match stats recorded in 4 major grand slam tournaments of tennis, for both men (ATP) and women (WTA). You'll see some very famous players such as Djokovic, Nadal, Federer and Murray.


Below is a detail of our reproduced datasets:

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2)
FSP.1 First Serve Percentage for player 1 (Real Number)
FSW.1 First Serve Won by player 1 (Real Number)
SSP.1 Second Serve Percentage for player 1 (Real Number)
SSW.1 Second Serve Won by player 1 (Real Number)
ACE.1 Aces won by player 1 (Numeric-Integer)
DBF.1 Double Faults committed by player 1 (Numeric-Integer)
WNR.1 Winners earned by player 1 (Numeric)
UFE.1 Unforced Errors committed by player 1 (Numeric)
BPC.1 Break Points Created by player 1 (Numeric)
BPW.1 Break Points Won by player 1 (Numeric)
NPA.1 Net Points Attempted by player 1 (Numeric)
NPW.1 Net Points Won by player 1 (Numeric)
FSP.2 First Serve Percentage for player 2 (Real Number)
FSW.2 First Serve Won by player 2 (Real Number)
SSP.2 Second Serve Percentage for player 2 (Real Number)
SSW.2 Second Serve Won by player 2 (Real Number)
ACE.2 Aces won by player 2 (Numeric-Integer)
DBF.2 Double Faults committed by player 2 (Numeric-Integer)
WNR.2 Winners earned by player 2 (Numeric)
UFE.2 Unforced Errors committed by player 2 (Numeric)
BPC.2 Break Points Created by player 2 (Numeric)
BPW.2 Break Points Won by player 2 (Numeric)
NPA.2 Net Points Attempted by player 2 (Numeric)
NPW.2 Net Points Won by player 2 (Numeric)


To get easier to analyze them, let's run as below on R.

> dm<-read.delim("men.txt")
> dw<-read.delim("women.txt")
> dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]
> dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]


Here we set a mission as below:

  1. To determine "golden" metrics or to build a model from the men's dataset
  2. To predict women's results from a model given by rules or built with the men's dataset


The result is to be evaluated using confusion matrix.

Read more

Machine learning for package users with R (6): Xgboost (eXtreme Gradient Boosting)

As far as I've known, Xgboost is the most successful machine learning classifier in several competitions in machine learning, e.g. Kaggle or KDD cups. Indeed the team winning Higgs-Boson competition used Xgboost and below is their code release.



Very fortunately, we have a good implementation for R or Python and it's distributed via GitHub. You can get it from the link below.



{xgboost} package is now out of CRAN so you have to install it with {devtools} package. README.md of the GitHub repository above shows how to install and you can easily do it.

Read more

Machine learning for package users with R (5): Random Forest

Random Forest is still one of the strongest supervised learning methods although these days many people love to use Deep Learning or Convolutional NN. Of course because it's simple architecture and a lot of implementation in various environments or languages, e.g. Python, R.


The point I wanna emphasize here is that Random Forest is also a strong method for not only classification but also estimation of variable importance, derived from its origin, decision / regression tree.

Read more