Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

{mxnet} R package from MXnet, an intuitive Deep Learning framework including CNN & RNN

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it.



MXnet is a framework distributed by DMLC, the team also known as a distributor of Xgboost. Now its documentation looks to be completed and even pre-trained models for ImageNet are distributed. I think this should be a good news for R-users loving machine learning... so let's go.

Read more

Can multivariate modeling predict taste of wine? Beyond human intuition and univariate reductionism


At a certain meetup on the other day, I talked about a brand-new relationship between taste of wine (i.e. professional tasting) and data science. This talk was inspired by a book "Wine Science: The Application of Science in Winemaking". Below is its Japanese edition.


新しいワインの科学

新しいワインの科学


For readers who can't read Japanese, I summarized the content of the talk in this post. Just for your information, I myself am a super wine lover :) and I'm also much interested in how data science explain taste of wine.


In order to run analytics below, I prepared an R workspace in my GitHub repository. Please download and import it into your R environment.



(Note: All quotes from Goode's book here are reversely translated from the Japanese edition and it may contain not a few difference from the original version)

Read more

Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.


Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.


But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.


Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.



This dataset is reproduced from "Tennis Major Tournament Match Statistics" dataset distributed by UC Irvine Machine Learning Repository. It contains match results and match stats recorded in 4 major grand slam tournaments of tennis, for both men (ATP) and women (WTA). You'll see some very famous players such as Djokovic, Nadal, Federer and Murray.


Below is a detail of our reproduced datasets:

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2)
FSP.1 First Serve Percentage for player 1 (Real Number)
FSW.1 First Serve Won by player 1 (Real Number)
SSP.1 Second Serve Percentage for player 1 (Real Number)
SSW.1 Second Serve Won by player 1 (Real Number)
ACE.1 Aces won by player 1 (Numeric-Integer)
DBF.1 Double Faults committed by player 1 (Numeric-Integer)
WNR.1 Winners earned by player 1 (Numeric)
UFE.1 Unforced Errors committed by player 1 (Numeric)
BPC.1 Break Points Created by player 1 (Numeric)
BPW.1 Break Points Won by player 1 (Numeric)
NPA.1 Net Points Attempted by player 1 (Numeric)
NPW.1 Net Points Won by player 1 (Numeric)
FSP.2 First Serve Percentage for player 2 (Real Number)
FSW.2 First Serve Won by player 2 (Real Number)
SSP.2 Second Serve Percentage for player 2 (Real Number)
SSW.2 Second Serve Won by player 2 (Real Number)
ACE.2 Aces won by player 2 (Numeric-Integer)
DBF.2 Double Faults committed by player 2 (Numeric-Integer)
WNR.2 Winners earned by player 2 (Numeric)
UFE.2 Unforced Errors committed by player 2 (Numeric)
BPC.2 Break Points Created by player 2 (Numeric)
BPW.2 Break Points Won by player 2 (Numeric)
NPA.2 Net Points Attempted by player 2 (Numeric)
NPW.2 Net Points Won by player 2 (Numeric)


To get easier to analyze them, let's run as below on R.

> dm<-read.delim("men.txt")
> dw<-read.delim("women.txt")
> dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]
> dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]


Here we set a mission as below:

  1. To determine "golden" metrics or to build a model from the men's dataset
  2. To predict women's results from a model given by rules or built with the men's dataset


The result is to be evaluated using confusion matrix.

Read more