Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.

Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.

But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.

Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.

<a href="https://github.com/ozt-ca/tjo.hatenablog.samples/tree/master/r_samples/public_lib/jp/exp_uci_datasets/tennis">ozt-ca/tjo.hatenablog.samples</a>

This dataset is reproduced from "Tennis Major Tournament Match Statistics" dataset distributed by UC Irvine Machine Learning Repository. It contains match results and match stats recorded in 4 major grand slam tournaments of tennis, for both men (ATP) and women (WTA). You'll see some very famous players such as Djokovic, Nadal, Federer and Murray.

Below is a detail of our reproduced datasets:

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2)
FSP.1 First Serve Percentage for player 1 (Real Number)
FSW.1 First Serve Won by player 1 (Real Number)
SSP.1 Second Serve Percentage for player 1 (Real Number)
SSW.1 Second Serve Won by player 1 (Real Number)
ACE.1 Aces won by player 1 (Numeric-Integer)
DBF.1 Double Faults committed by player 1 (Numeric-Integer)
WNR.1 Winners earned by player 1 (Numeric)
UFE.1 Unforced Errors committed by player 1 (Numeric)
BPC.1 Break Points Created by player 1 (Numeric)
BPW.1 Break Points Won by player 1 (Numeric)
NPA.1 Net Points Attempted by player 1 (Numeric)
NPW.1 Net Points Won by player 1 (Numeric)
FSP.2 First Serve Percentage for player 2 (Real Number)
FSW.2 First Serve Won by player 2 (Real Number)
SSP.2 Second Serve Percentage for player 2 (Real Number)
SSW.2 Second Serve Won by player 2 (Real Number)
ACE.2 Aces won by player 2 (Numeric-Integer)
DBF.2 Double Faults committed by player 2 (Numeric-Integer)
WNR.2 Winners earned by player 2 (Numeric)
UFE.2 Unforced Errors committed by player 2 (Numeric)
BPC.2 Break Points Created by player 2 (Numeric)
BPW.2 Break Points Won by player 2 (Numeric)
NPA.2 Net Points Attempted by player 2 (Numeric)
NPW.2 Net Points Won by player 2 (Numeric)

To get easier to analyze them, let's run as below on R.

> dm<-read.delim("men.txt")
> dw<-read.delim("women.txt")
> dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]
> dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]

Here we set a mission as below:

To determine "golden" metrics or to build a model from the men's dataset
To predict women's results from a model given by rules or built with the men's dataset

The result is to be evaluated using confusion matrix.

A/B testing and rule-based prediction

OK, first let's run a t-test as an univariate analysis on each explanatory variable. In this manner of analytics, we expect that finally we have some "golden" metrics and we can determine rules in order to predict outcome from new datasets. Below is a structure of the men's dataset.

> str(dm)
'data.frame':	491 obs. of  25 variables:
 $ Result: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 2 1 2 ...
 $ FSP.1 : int  61 61 52 53 76 65 68 47 64 77 ...
 $ FSW.1 : int  35 31 53 39 63 51 73 18 26 76 ...
 $ SSP.1 : int  39 39 48 47 24 35 32 53 36 23 ...
 $ SSW.1 : int  18 13 20 24 12 22 24 15 12 11 ...
 $ ACE.1 : int  5 13 8 8 0 9 5 3 3 6 ...
 $ DBF.1 : int  1 1 4 6 4 3 3 4 0 4 ...
 $ WNR.1 : int  17 13 37 8 16 35 41 21 20 6 ...
 $ UFE.1 : int  29 1 50 6 35 41 50 31 39 4 ...
 $ BPC.1 : int  1 7 1 6 3 2 9 6 3 7 ...
 $ BPW.1 : int  3 14 9 9 12 7 17 20 7 24 ...
 $ NPA.1 : int  8 0 16 0 9 6 14 6 5 0 ...
 $ NPW.1 : int  11 0 23 0 13 12 30 9 14 0 ...
 $ FSP.2 : int  68 60 77 50 53 63 60 54 67 60 ...
 $ FSW.2 : int  45 23 57 24 59 60 66 26 42 68 ...
 $ SSP.2 : int  32 40 23 50 47 37 40 46 33 40 ...
 $ SSW.2 : int  17 9 15 19 32 22 34 13 14 25 ...
 $ ACE.2 : int  10 1 9 1 17 24 2 0 12 8 ...
 $ DBF.2 : int  0 4 1 8 11 4 6 11 0 12 ...
 $ WNR.2 : int  40 1 41 1 59 47 57 11 32 8 ...
 $ UFE.2 : int  30 4 41 8 79 45 72 46 20 12 ...
 $ BPC.2 : int  4 0 4 1 3 4 10 2 7 6 ...
 $ BPW.2 : int  8 0 13 7 5 7 17 6 10 14 ...
 $ NPA.2 : int  8 0 12 0 16 14 25 8 8 0 ...
 $ NPW.2 : int  9 0 16 0 28 17 36 12 11 0 ...

In principle we have to run a t-test on each pair such as FSP.1 and FSP.2 in one-by-one manner, and then if the test shows a significant difference of mean value between them, we can take them as one of "golden" metrics and set up a rule-based predictor as below.

> table(dw$Result,ifelse(dw$FSP.1>=dw$FSP.2,1,0))

This is a very simple rule-based predictor that returns 1 (won) if FSP.1 >= FSP.2 and 0 (lost) vice versa. Let's run a series of t-tests.

> t.test(dm$FSP.1,dm$FSP.2)

	Welch Two Sample t-test

data:  dm$FSP.1 and dm$FSP.2
t = 1.2133, df = 978.539, p-value = 0.2253
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3533561  1.4979590
sample estimates:
mean of x mean of y 
 61.89613  61.32383 

> t.test(dm$FSW.1,dm$FSW.2)

	Welch Two Sample t-test

data:  dm$FSW.1 and dm$FSW.2
t = 0.3966, df = 979.277, p-value = 0.6917
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.640232  2.471189
sample estimates:
mean of x mean of y 
 49.24236  48.82688 

> t.test(dm$SSP.1,dm$SSP.2)

	Welch Two Sample t-test

data:  dm$SSP.1 and dm$SSP.2
t = -1.2133, df = 978.539, p-value = 0.2253
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.4979590  0.3533561
sample estimates:
mean of x mean of y 
 38.10387  38.67617 

> t.test(dm$SSW.1,dm$SSW.2)

	Welch Two Sample t-test

data:  dm$SSW.1 and dm$SSW.2
t = -0.7498, df = 979.997, p-value = 0.4536
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.4661090  0.6555183
sample estimates:
mean of x mean of y 
 21.41752  21.82281 

> t.test(dm$ACE.1,dm$ACE.2)

	Welch Two Sample t-test

data:  dm$ACE.1 and dm$ACE.2
t = 0.1519, df = 979.99, p-value = 0.8793
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.7524564  0.8787293
sample estimates:
mean of x mean of y 
 9.034623  8.971487 

> t.test(dm$DBF.1,dm$DBF.2)

	Welch Two Sample t-test

data:  dm$DBF.1 and dm$DBF.2
t = -0.646, df = 979.909, p-value = 0.5184
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4769591  0.2407066
sample estimates:
mean of x mean of y 
 3.926680  4.044807 

> t.test(dm$WNR.1,dm$WNR.2)

	Welch Two Sample t-test

data:  dm$WNR.1 and dm$WNR.2
t = 0.1385, df = 978.411, p-value = 0.8899
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.521403  2.904295
sample estimates:
mean of x mean of y 
 27.30550  27.11405 

> t.test(dm$UFE.1,dm$UFE.2)

	Welch Two Sample t-test

data:  dm$UFE.1 and dm$UFE.2
t = -0.5165, df = 978.942, p-value = 0.6056
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.147494  1.835885
sample estimates:
mean of x mean of y 
 22.97963  23.63544 

> t.test(dm$BPC.1,dm$BPC.2)

	Welch Two Sample t-test

data:  dm$BPC.1 and dm$BPC.2
t = 1.2245, df = 979.305, p-value = 0.2211
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1791842  0.7738889
sample estimates:
mean of x mean of y 
 5.052953  4.755601 

> t.test(dm$BPW.1,dm$BPW.2)

	Welch Two Sample t-test

data:  dm$BPW.1 and dm$BPW.2
t = 0.3754, df = 979.784, p-value = 0.7075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5682952  0.8371344
sample estimates:
mean of x mean of y 
 8.095723  7.961303 

> t.test(dm$NPA.1,dm$NPA.2)

	Welch Two Sample t-test

data:  dm$NPA.1 and dm$NPA.2
t = -0.6571, df = 979.895, p-value = 0.5113
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.354531  1.173268
sample estimates:
mean of x mean of y 
 17.60081  18.19145 

> t.test(dm$NPW.1,dm$NPW.2)

	Welch Two Sample t-test

data:  dm$NPW.1 and dm$NPW.2
t = -0.8552, df = 969.859, p-value = 0.3926
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.549798  1.001936
sample estimates:
mean of x mean of y 
 20.57230  21.34623

Nope, entirely NO T-TESTS show significant difference... ah, well, I guess because those were "unpaired" t-tests. OK, let's run again with "paired" t-tests.

> t.test(dm$FSP.1-dm$FSP.2)

	One Sample t-test

data:  dm$FSP.1 - dm$FSP.2
t = 1.2852, df = 490, p-value = 0.1993
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.3026416  1.4472445
sample estimates:
mean of x 
0.5723014 

> t.test(dm$FSW.1-dm$FSW.2)

	One Sample t-test

data:  dm$FSW.1 - dm$FSW.2
t = 0.953, df = 490, p-value = 0.3411
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.4411259  1.2720831
sample estimates:
mean of x 
0.4154786 

> t.test(dm$SSP.1-dm$SSP.2)

	One Sample t-test

data:  dm$SSP.1 - dm$SSP.2
t = -1.2852, df = 490, p-value = 0.1993
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.4472445  0.3026416
sample estimates:
 mean of x 
-0.5723014 

> t.test(dm$SSW.1-dm$SSW.2)

	One Sample t-test

data:  dm$SSW.1 - dm$SSW.2
t = -1.1533, df = 490, p-value = 0.2493
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.0957532  0.2851626
sample estimates:
 mean of x 
-0.4052953 

> t.test(dm$ACE.1-dm$ACE.2)

	One Sample t-test

data:  dm$ACE.1 - dm$ACE.2
t = 0.1696, df = 490, p-value = 0.8654
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.6684912  0.7947642
sample estimates:
 mean of x 
0.06313646 

> t.test(dm$DBF.1-dm$DBF.2)

	One Sample t-test

data:  dm$DBF.1 - dm$DBF.2
t = -0.6925, df = 490, p-value = 0.489
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.4532985  0.2170459
sample estimates:
 mean of x 
-0.1181263 

> t.test(dm$WNR.1-dm$WNR.2)

	One Sample t-test

data:  dm$WNR.1 - dm$WNR.2
t = 0.2815, df = 490, p-value = 0.7784
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.144589  1.527481
sample estimates:
mean of x 
 0.191446 

> t.test(dm$UFE.1-dm$UFE.2)

	One Sample t-test

data:  dm$UFE.1 - dm$UFE.2
t = -1.0618, df = 490, p-value = 0.2888
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.8693327  0.5577237
sample estimates:
 mean of x 
-0.6558045 

> t.test(dm$BPC.1-dm$BPC.2)

	One Sample t-test

data:  dm$BPC.1 - dm$BPC.2
t = 1.3144, df = 490, p-value = 0.1893
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.1471336  0.7418383
sample estimates:
mean of x 
0.2973523 

> t.test(dm$BPW.1-dm$BPW.2)

	One Sample t-test

data:  dm$BPW.1 - dm$BPW.2
t = 0.4239, df = 490, p-value = 0.6718
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.4886140  0.7574531
sample estimates:
mean of x 
0.1344196 

> t.test(dm$NPA.1-dm$NPA.2)

	One Sample t-test

data:  dm$NPA.1 - dm$NPA.2
t = -1.0838, df = 490, p-value = 0.279
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.6613778  0.4801151
sample estimates:
 mean of x 
-0.5906314 

> t.test(dm$NPW.1-dm$NPW.2)

	One Sample t-test

data:  dm$NPW.1 - dm$NPW.2
t = -1.2888, df = 490, p-value = 0.1981
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.9538517  0.4059902
sample estimates:
 mean of x 
-0.7739308

OMG, even though it's a "paired" t-test, no significant difference of the mean appeared. This result is not so surprising: see a plot below, just showing the mean and the standard deviation as error bar of each metric.

f:id:TJO:20150221162616p:plain

Almost all metrics show too large error bars. :( Just for your information, I tried to build a rule-based predictor with metrics showing the lowest p-value, "BPC.1 and BPC.2".

> table(dw$Result,ifelse(dw$BPC.1>=dw$BPC.2,1,0))
   
      0   1
  0 202  25
  1  21 204

> sum(diag(table(dw$Result,ifelse(dw$BPC.1>=dw$BPC.2,1,0))))/nrow(dw)
[1] 0.8982301 # Accuracy 89.8%...

It appears that even metrics with non-significant difference of the mean can predict women's result to some extent... but do you want to conclude that these match stats are never useful for predicting results of match?

Multivariate modelings

In short, I don't think so. I know in such a case multivariate modelings work well. Below are examples of such multivariate modelings.

# L1-penalized logistic regression
> library(glmnet)
> dm.cv.glmnet<-cv.glmnet(as.matrix(dm[,-1]),as.matrix(dm[,1]),family="binomial",alpha=1)
> coef(dm.cv.glmnet,s=dm.cv.glmnet$lambda.min)
25 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  1.507835e-01
FSP.1        4.779216e-02
FSW.1        1.252779e-01
SSP.1       -4.022443e-05
SSW.1        1.629827e-01
ACE.1        .           
DBF.1       -9.460367e-02
WNR.1        3.979866e-02
UFE.1       -7.996179e-03
BPC.1        3.731964e-01 # Best parameter!
BPW.1        2.176386e-01
NPA.1        .           
NPW.1        .           
FSP.2       -3.429355e-02
FSW.2       -1.680302e-01
SSP.2        .           
SSW.2       -1.451930e-01
ACE.2        1.487694e-02
DBF.2        4.696238e-02
WNR.2       -2.227043e-02
UFE.2       -1.778775e-03
BPC.2       -3.599556e-01 # Best parameter!
BPW.2       -2.105379e-01
NPA.2        .           
NPW.2        1.424483e-02

> table(dw$Result,round(predict(dm.cv.glmnet,newx=as.matrix(dw[,-1]),type='response',s=dm.cv.glmnet$lambda.min),0))
   
      0   1
  0 216  11
  1  19 206

> sum(diag(table(dw$Result,round(predict(dm.cv.glmnet,newx=as.matrix(dw[,-1]),type='response',s=dm.cv.glmnet$lambda.min),0))))/nrow(dw)
[1] 0.9336283 # Accuracy 93.4%

Yeah, L1-penalized logistic regression showed "BPC. 1/2" parameter is the most influential for the results. But I'm not sure whether this one is the best.

# Linear SVM (not Gaussian kernel SVM!)
> library(e1071)
> dm$Result<-as.factor(dm$Result)
> dw$Result<-as.factor(dw$Result)
> dm.tune<-tune.svm(Result~.,data=dm,kernel='linear')
> dm.tune$best.model

Call:
best.svm(x = Result ~ ., data = dm, kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.04166667 

Number of Support Vectors:  97

> dm.linear.svm<-svm(Result~.,dm,kernel='linear',cost=dm.tune$best.model$cost,gamma=dm.tune$best.model$gamma)
> table(dw$Result,predict(dm.linear.svm,newdata=dw[,-1]))
   
      0   1
  0 214  13
  1  16 209

> sum(diag(table(dw$Result,predict(dm.linear.svm,newdata=dw[,-1]))))/nrow(dw)
[1] 0.9358407 # Accuracy 93.6% !!!

Actually already I tried a wide variety of machine learning classifiers and this one was the best model for this tennis datasets :P)

Conclusions

The result told us that univariate stats and rule-based predictors given by usual hypothesis testing on them sometimes fail, while multivariate modelings work well given by (generalized) linear models or machine learning classifiers.

In general, multi-dimensional and multivariate features usually represent more complex information and internal structure of datasets than univariate features. But in many situations in marketing, not a few people neglect an importance of multivariate information and even persist in running univariate A/B tests and looking for "golden features or metrics".

Even when multiple features have "partial" correlations, such univariate A/B testing can be wrong because partial correlation easily affects outcome of usual univariate correlation (and also univariate testing).

If you have multivariate datasets, please try multivariate modelings and don't persist in univariate A/B testing.

Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Univariate stats sometimes fail, while multivariate modelings work well

A/B testing and rule-based prediction

Multivariate modelings

Conclusions