In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative components of UI, choice of ads, background images of the page, etc.

Such influential metrics are sometimes called "golden feature" or "golden metric" -- even though it sounds ridiculous -- and many people are looking for it very hard, as they firmly believe "once the metric is found, we can very easily raise revenue and/or profit with just raising the golden metric!!". Ironically, not a few A/B tests are run on such a basis.

But, is it really true? If you find any kind of such golden metrics, can you really raise revenue, gather more users, or get more conversions? Yes, in some cases it may be true; however you have to see a case that theoretically it cannot be.

Below is a link to a dataset we use here. Please download "men.txt" and "women.txt" and import them as "men" and "women" respectively.

This dataset is reproduced from "Tennis Major Tournament Match Statistics" dataset distributed by UC Irvine Machine Learning Repository. It contains match results and match stats recorded in 4 major grand slam tournaments of tennis, for both men (ATP) and women (WTA). You'll see some very famous players such as Djokovic, Nadal, Federer and Murray.

Below is a detail of our reproduced datasets:

Result Result of the match (0/1) - Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2)

FSP.1 First Serve Percentage for player 1 (Real Number)

FSW.1 First Serve Won by player 1 (Real Number)

SSP.1 Second Serve Percentage for player 1 (Real Number)

SSW.1 Second Serve Won by player 1 (Real Number)

ACE.1 Aces won by player 1 (Numeric-Integer)

DBF.1 Double Faults committed by player 1 (Numeric-Integer)

WNR.1 Winners earned by player 1 (Numeric)

UFE.1 Unforced Errors committed by player 1 (Numeric)

BPC.1 Break Points Created by player 1 (Numeric)

BPW.1 Break Points Won by player 1 (Numeric)

NPA.1 Net Points Attempted by player 1 (Numeric)

NPW.1 Net Points Won by player 1 (Numeric)

FSP.2 First Serve Percentage for player 2 (Real Number)

FSW.2 First Serve Won by player 2 (Real Number)

SSP.2 Second Serve Percentage for player 2 (Real Number)

SSW.2 Second Serve Won by player 2 (Real Number)

ACE.2 Aces won by player 2 (Numeric-Integer)

DBF.2 Double Faults committed by player 2 (Numeric-Integer)

WNR.2 Winners earned by player 2 (Numeric)

UFE.2 Unforced Errors committed by player 2 (Numeric)

BPC.2 Break Points Created by player 2 (Numeric)

BPW.2 Break Points Won by player 2 (Numeric)

NPA.2 Net Points Attempted by player 2 (Numeric)

NPW.2 Net Points Won by player 2 (Numeric)

To get easier to analyze them, let's run as below on R.

> dm<-read.delim("men.txt") > dw<-read.delim("women.txt") > dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)] > dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)]

Here we set a mission as below:

- To determine "golden" metrics or to build a model from the men's dataset
- To predict women's results from a model given by rules or built with the men's dataset

The result is to be evaluated using confusion matrix.

### A/B testing and rule-based prediction

OK, first let's run a t-test as an univariate analysis on each explanatory variable. In this manner of analytics, we expect that finally we have some "golden" metrics and we can determine rules in order to predict outcome from new datasets. Below is a structure of the men's dataset.

> str(dm) 'data.frame': 491 obs. of 25 variables: $ Result: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 2 1 2 ... $ FSP.1 : int 61 61 52 53 76 65 68 47 64 77 ... $ FSW.1 : int 35 31 53 39 63 51 73 18 26 76 ... $ SSP.1 : int 39 39 48 47 24 35 32 53 36 23 ... $ SSW.1 : int 18 13 20 24 12 22 24 15 12 11 ... $ ACE.1 : int 5 13 8 8 0 9 5 3 3 6 ... $ DBF.1 : int 1 1 4 6 4 3 3 4 0 4 ... $ WNR.1 : int 17 13 37 8 16 35 41 21 20 6 ... $ UFE.1 : int 29 1 50 6 35 41 50 31 39 4 ... $ BPC.1 : int 1 7 1 6 3 2 9 6 3 7 ... $ BPW.1 : int 3 14 9 9 12 7 17 20 7 24 ... $ NPA.1 : int 8 0 16 0 9 6 14 6 5 0 ... $ NPW.1 : int 11 0 23 0 13 12 30 9 14 0 ... $ FSP.2 : int 68 60 77 50 53 63 60 54 67 60 ... $ FSW.2 : int 45 23 57 24 59 60 66 26 42 68 ... $ SSP.2 : int 32 40 23 50 47 37 40 46 33 40 ... $ SSW.2 : int 17 9 15 19 32 22 34 13 14 25 ... $ ACE.2 : int 10 1 9 1 17 24 2 0 12 8 ... $ DBF.2 : int 0 4 1 8 11 4 6 11 0 12 ... $ WNR.2 : int 40 1 41 1 59 47 57 11 32 8 ... $ UFE.2 : int 30 4 41 8 79 45 72 46 20 12 ... $ BPC.2 : int 4 0 4 1 3 4 10 2 7 6 ... $ BPW.2 : int 8 0 13 7 5 7 17 6 10 14 ... $ NPA.2 : int 8 0 12 0 16 14 25 8 8 0 ... $ NPW.2 : int 9 0 16 0 28 17 36 12 11 0 ...

In principle we have to run a t-test on each pair such as FSP.1 and FSP.2 in one-by-one manner, and then if the test shows a significant difference of mean value between them, we can take them as one of "golden" metrics and set up a rule-based predictor as below.

> table(dw$Result,ifelse(dw$FSP.1>=dw$FSP.2,1,0))

This is a very simple rule-based predictor that returns 1 (won) if FSP.1 >= FSP.2 and 0 (lost) vice versa. Let's run a series of t-tests.

> t.test(dm$FSP.1,dm$FSP.2) Welch Two Sample t-test data: dm$FSP.1 and dm$FSP.2 t = 1.2133, df = 978.539, p-value = 0.2253 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3533561 1.4979590 sample estimates: mean of x mean of y 61.89613 61.32383 > t.test(dm$FSW.1,dm$FSW.2) Welch Two Sample t-test data: dm$FSW.1 and dm$FSW.2 t = 0.3966, df = 979.277, p-value = 0.6917 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.640232 2.471189 sample estimates: mean of x mean of y 49.24236 48.82688 > t.test(dm$SSP.1,dm$SSP.2) Welch Two Sample t-test data: dm$SSP.1 and dm$SSP.2 t = -1.2133, df = 978.539, p-value = 0.2253 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.4979590 0.3533561 sample estimates: mean of x mean of y 38.10387 38.67617 > t.test(dm$SSW.1,dm$SSW.2) Welch Two Sample t-test data: dm$SSW.1 and dm$SSW.2 t = -0.7498, df = 979.997, p-value = 0.4536 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.4661090 0.6555183 sample estimates: mean of x mean of y 21.41752 21.82281 > t.test(dm$ACE.1,dm$ACE.2) Welch Two Sample t-test data: dm$ACE.1 and dm$ACE.2 t = 0.1519, df = 979.99, p-value = 0.8793 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.7524564 0.8787293 sample estimates: mean of x mean of y 9.034623 8.971487 > t.test(dm$DBF.1,dm$DBF.2) Welch Two Sample t-test data: dm$DBF.1 and dm$DBF.2 t = -0.646, df = 979.909, p-value = 0.5184 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.4769591 0.2407066 sample estimates: mean of x mean of y 3.926680 4.044807 > t.test(dm$WNR.1,dm$WNR.2) Welch Two Sample t-test data: dm$WNR.1 and dm$WNR.2 t = 0.1385, df = 978.411, p-value = 0.8899 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.521403 2.904295 sample estimates: mean of x mean of y 27.30550 27.11405 > t.test(dm$UFE.1,dm$UFE.2) Welch Two Sample t-test data: dm$UFE.1 and dm$UFE.2 t = -0.5165, df = 978.942, p-value = 0.6056 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.147494 1.835885 sample estimates: mean of x mean of y 22.97963 23.63544 > t.test(dm$BPC.1,dm$BPC.2) Welch Two Sample t-test data: dm$BPC.1 and dm$BPC.2 t = 1.2245, df = 979.305, p-value = 0.2211 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1791842 0.7738889 sample estimates: mean of x mean of y 5.052953 4.755601 > t.test(dm$BPW.1,dm$BPW.2) Welch Two Sample t-test data: dm$BPW.1 and dm$BPW.2 t = 0.3754, df = 979.784, p-value = 0.7075 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.5682952 0.8371344 sample estimates: mean of x mean of y 8.095723 7.961303 > t.test(dm$NPA.1,dm$NPA.2) Welch Two Sample t-test data: dm$NPA.1 and dm$NPA.2 t = -0.6571, df = 979.895, p-value = 0.5113 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.354531 1.173268 sample estimates: mean of x mean of y 17.60081 18.19145 > t.test(dm$NPW.1,dm$NPW.2) Welch Two Sample t-test data: dm$NPW.1 and dm$NPW.2 t = -0.8552, df = 969.859, p-value = 0.3926 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.549798 1.001936 sample estimates: mean of x mean of y 20.57230 21.34623

Nope, entirely NO T-TESTS show significant difference... ah, well, I guess because those were "unpaired" t-tests. OK, let's run again with "paired" t-tests.

> t.test(dm$FSP.1-dm$FSP.2) One Sample t-test data: dm$FSP.1 - dm$FSP.2 t = 1.2852, df = 490, p-value = 0.1993 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.3026416 1.4472445 sample estimates: mean of x 0.5723014 > t.test(dm$FSW.1-dm$FSW.2) One Sample t-test data: dm$FSW.1 - dm$FSW.2 t = 0.953, df = 490, p-value = 0.3411 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.4411259 1.2720831 sample estimates: mean of x 0.4154786 > t.test(dm$SSP.1-dm$SSP.2) One Sample t-test data: dm$SSP.1 - dm$SSP.2 t = -1.2852, df = 490, p-value = 0.1993 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.4472445 0.3026416 sample estimates: mean of x -0.5723014 > t.test(dm$SSW.1-dm$SSW.2) One Sample t-test data: dm$SSW.1 - dm$SSW.2 t = -1.1533, df = 490, p-value = 0.2493 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.0957532 0.2851626 sample estimates: mean of x -0.4052953 > t.test(dm$ACE.1-dm$ACE.2) One Sample t-test data: dm$ACE.1 - dm$ACE.2 t = 0.1696, df = 490, p-value = 0.8654 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.6684912 0.7947642 sample estimates: mean of x 0.06313646 > t.test(dm$DBF.1-dm$DBF.2) One Sample t-test data: dm$DBF.1 - dm$DBF.2 t = -0.6925, df = 490, p-value = 0.489 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.4532985 0.2170459 sample estimates: mean of x -0.1181263 > t.test(dm$WNR.1-dm$WNR.2) One Sample t-test data: dm$WNR.1 - dm$WNR.2 t = 0.2815, df = 490, p-value = 0.7784 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.144589 1.527481 sample estimates: mean of x 0.191446 > t.test(dm$UFE.1-dm$UFE.2) One Sample t-test data: dm$UFE.1 - dm$UFE.2 t = -1.0618, df = 490, p-value = 0.2888 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.8693327 0.5577237 sample estimates: mean of x -0.6558045 > t.test(dm$BPC.1-dm$BPC.2) One Sample t-test data: dm$BPC.1 - dm$BPC.2 t = 1.3144, df = 490, p-value = 0.1893 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.1471336 0.7418383 sample estimates: mean of x 0.2973523 > t.test(dm$BPW.1-dm$BPW.2) One Sample t-test data: dm$BPW.1 - dm$BPW.2 t = 0.4239, df = 490, p-value = 0.6718 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.4886140 0.7574531 sample estimates: mean of x 0.1344196 > t.test(dm$NPA.1-dm$NPA.2) One Sample t-test data: dm$NPA.1 - dm$NPA.2 t = -1.0838, df = 490, p-value = 0.279 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.6613778 0.4801151 sample estimates: mean of x -0.5906314 > t.test(dm$NPW.1-dm$NPW.2) One Sample t-test data: dm$NPW.1 - dm$NPW.2 t = -1.2888, df = 490, p-value = 0.1981 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.9538517 0.4059902 sample estimates: mean of x -0.7739308

OMG, even though it's a "paired" t-test, no significant difference of the mean appeared. This result is not so surprising: see a plot below, just showing the mean and the standard deviation as error bar of each metric.

Almost all metrics show too large error bars. :( Just for your information, I tried to build a rule-based predictor with metrics showing the lowest p-value, "BPC.1 and BPC.2".

> table(dw$Result,ifelse(dw$BPC.1>=dw$BPC.2,1,0)) 0 1 0 202 25 1 21 204 > sum(diag(table(dw$Result,ifelse(dw$BPC.1>=dw$BPC.2,1,0))))/nrow(dw) [1] 0.8982301 # Accuracy 89.8%...

It appears that even metrics with non-significant difference of the mean can predict women's result to some extent... but do you want to conclude that these match stats are never useful for predicting results of match?

### Multivariate modelings

In short, I don't think so. I know in such a case multivariate modelings work well. Below are examples of such multivariate modelings.

# L1-penalized logistic regression > library(glmnet) > dm.cv.glmnet<-cv.glmnet(as.matrix(dm[,-1]),as.matrix(dm[,1]),family="binomial",alpha=1) > coef(dm.cv.glmnet,s=dm.cv.glmnet$lambda.min) 25 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 1.507835e-01 FSP.1 4.779216e-02 FSW.1 1.252779e-01 SSP.1 -4.022443e-05 SSW.1 1.629827e-01 ACE.1 . DBF.1 -9.460367e-02 WNR.1 3.979866e-02 UFE.1 -7.996179e-03 BPC.1 3.731964e-01 # Best parameter! BPW.1 2.176386e-01 NPA.1 . NPW.1 . FSP.2 -3.429355e-02 FSW.2 -1.680302e-01 SSP.2 . SSW.2 -1.451930e-01 ACE.2 1.487694e-02 DBF.2 4.696238e-02 WNR.2 -2.227043e-02 UFE.2 -1.778775e-03 BPC.2 -3.599556e-01 # Best parameter! BPW.2 -2.105379e-01 NPA.2 . NPW.2 1.424483e-02 > table(dw$Result,round(predict(dm.cv.glmnet,newx=as.matrix(dw[,-1]),type='response',s=dm.cv.glmnet$lambda.min),0)) 0 1 0 216 11 1 19 206 > sum(diag(table(dw$Result,round(predict(dm.cv.glmnet,newx=as.matrix(dw[,-1]),type='response',s=dm.cv.glmnet$lambda.min),0))))/nrow(dw) [1] 0.9336283 # Accuracy 93.4%

Yeah, L1-penalized logistic regression showed "BPC. 1/2" parameter is the most influential for the results. But I'm not sure whether this one is the best.

# Linear SVM (not Gaussian kernel SVM!) > library(e1071) > dm$Result<-as.factor(dm$Result) > dw$Result<-as.factor(dw$Result) > dm.tune<-tune.svm(Result~.,data=dm,kernel='linear') > dm.tune$best.model Call: best.svm(x = Result ~ ., data = dm, kernel = "linear") Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 1 gamma: 0.04166667 Number of Support Vectors: 97 > dm.linear.svm<-svm(Result~.,dm,kernel='linear',cost=dm.tune$best.model$cost,gamma=dm.tune$best.model$gamma) > table(dw$Result,predict(dm.linear.svm,newdata=dw[,-1])) 0 1 0 214 13 1 16 209 > sum(diag(table(dw$Result,predict(dm.linear.svm,newdata=dw[,-1]))))/nrow(dw) [1] 0.9358407 # Accuracy 93.6% !!!

Actually already I tried a wide variety of machine learning classifiers and this one was the best model for this tennis datasets :P)

### Conclusions

The result told us that univariate stats and rule-based predictors given by usual hypothesis testing on them sometimes fail, while multivariate modelings work well given by (generalized) linear models or machine learning classifiers.

In general, multi-dimensional and multivariate features usually represent more complex information and internal structure of datasets than univariate features. But in many situations in marketing, not a few people neglect an importance of multivariate information and even persist in running univariate A/B tests and looking for "golden features or metrics".

Even when multiple features have "partial" correlations, such univariate A/B testing can be wrong because partial correlation easily affects outcome of usual univariate correlation (and also univariate testing).

If you have multivariate datasets, please try multivariate modelings and don't persist in univariate A/B testing.