Simple analytics work fast, but cannot avoid third-party effects

(The original posts in Japanese version are here and here )

In Japan, from my own experience, there may be a dichotomy between "analytics" and "data science". It has been said that real business matters require rapid analyses and rapid actions so that people usually like simple and rapid analytic works rather than data science works which are time-consuming and need a lot of expertise. Consequently not a few companies like to hire "analysts" as analytic experts and to let them to run a rapid analysis on each business project.

For example, some previous colleagues loved such a kind of simple analytics that merely describes which UI component should be good for KPIs. Imagine you have to set an order of priority on UI/UX components of a web service and now you have a data frame of a conversion (CV) flag and UI component flags with 0/1 values as below.

a1	a2	a3	a4	a5	a6	a7	cv
1	1	1	0	1	1	0	Yes
0	1	0	1	0	0	0	No
0	0	0	1	1	1	0	Yes
1	0	0	1	1	1	0	Yes
0	0	1	1	0	0	1	No
...	...	...	...	...	...	...	...

Simple analytics lovers often compute and conclude as below.

a1	a2	a3	a4	a5	a6	a7	CV
40.1%	58.3%	47.9%	94.2%	30.7%	5.6%	50.0%	No
60.5%	41.7%	49.4%	43.6%	68.4%	92.7%	49.3%	Yes
20.3%	-16.6%	1.5%	50.6%	37.7%	87.1%	-0.7%	Yes - No

"a1, a3, a5 and a6 increase CV (because they're positive), but a2, a4 and a7 decrease CV (because negative), as a priority order"

This is a result from a very very simple analytics: they just compute a ratio (percentage) of a flag corresponding to either Yes or No of each explanatory variable, and show you. Yes, it looks somewhat plausible... but is it really OK?

Partial correlation reveals third-party effects

In general, the data above is considered as a multivariate data and multivariate data usually contains any partial correlation. We can get partial correlation values using pcor(){ppcor} function in R. Below is the result. Here we have to focus on a partial correlation value between a7 and cv.

	a1	a2	a3	a4	a5	a6	a7
a2	-0.027
a3	-0.005	0.003
a4	-0.005	0.012	0.019
a5	0.027	-0.007	-0.025	0.015
a6	-0.031	-0.013	0.018	-0.020	0.015
a7	0.006	-0.029	-0.031	0.007	-0.003	-0.011
cv	0.112	-0.059	0.003	-0.284	0.176	0.807	0.006

Partial correlation shows that the value between a7 and cv is "positive". Wait, is it opposite to the result above given by simple analytics??? ...yes, exactly. Actually I manipulated this artificial data as a part of a7 strongly correlates with a part of a6 under "Yes" condition and it means that a relation between a7 and cv is skewed by such a localized partial correlation between a6 and a7, and they can be called a "third-party effect".

This result is just manipulated by my arbitrary intervention on a specific part of explanatory variables, but I know such a curious situation easily occurs in the case of various marketing data in the real world. The situation is not so rare because a combination of explanatory variables is too complicated in the case of micro-marketing with behavioral data from action log. It's no wonder accidentally parts of explanatory variables correlate with parts of the others. That is one of the representative features of multivariate data: one easily affects the other.

You may say that a difference between the skewed and incorrect value of a7 and the correct value is very slight. Yes, I agree so; but in many business scenes, we require not only a value itself but also an order of priority even if some of them are not statistically significant. We cannot ignore even such a slight difference in some cases.

Indeed, I once saw the same situation in a real data in my previous job and I was convinced there was some third-party effects... but nobody (no analytics lovers) cared, I remember.

Multivariate statistics can avoid it

Now we know there may be any third-party effects in the case of multivariate data. But, running a partial correlation looks annoying and its result as a matrix looks hard to read out. Any other better way to avoid or correctly evaluate third-party effects?

Yes, we have it: multivariate statistics, including all kinds of linear multiple regression and generalized linear model (GLM). They can much easily avoid such a third-party effect. In the case above, logistic regression (logit model) should be the best choice. Logistic regression gives regression coefficients by solving a logarithmic likelihood equation with partial derivatives as below:

$logL(\beta_0, \beta_1, \cdots, \beta_m)$
$= {\displaystyle\sum_{i=1}^{n}} \{Y_i logF(\beta_0 + {\displaystyle \sum_{j=1}^{m} \beta_j X_j}) + (1-Y_i) log [ 1-F(\beta_0 + {\displaystyle \sum_{j=1}^{m} \beta_j X_j} ) ] \}$

With solving simultaneous equations with partial derivatives below, you'll obtain regression coefficients $\beta_0, \beta_1, \cdots, \beta_m$ ,

$\frac{\partial log L (\beta_0, \beta_i, \cdots, \beta_m)}{\partial \beta_0} = 0$
$\frac{\partial log L (\beta_0, \beta_i, \cdots, \beta_m)}{\partial \beta_1} = 0$
$\vdots$
$\frac{\partial log L (\beta_0, \beta_i, \cdots, \beta_m)}{\partial \beta_m} = 0$

(F is a function of the logistic distribution)

as a lot of statistical texts describe. Of course it can be solved by R, Python with numerical solution for MLE or any other statistical frameworks including MCMC *1. Below is an example run by glm(){stats} function in R.

> d.glm<-glm(cv~.,sample_d,family="binomial")
> summary(d.glm)

Call:
glm(formula = cv ~ ., family = "binomial", data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6404  -0.2242  -0.0358   0.2162   3.1418  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.37793    0.25979  -5.304 1.13e-07 ***
a1           1.05846    0.17344   6.103 1.04e-09 ***
a2          -0.54914    0.16752  -3.278  0.00105 ** 
a3           0.12035    0.16803   0.716  0.47386    
a4          -3.00110    0.21653 -13.860  < 2e-16 ***
a5           1.53098    0.17349   8.824  < 2e-16 ***
a6           5.33547    0.19191  27.802  < 2e-16 ***
a7           0.07811    0.16725   0.467  0.64048 # <- See here!
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4158.9  on 2999  degrees of freedom
Residual deviance: 1044.4  on 2992  degrees of freedom
AIC: 1060.4

Number of Fisher Scoring iterations: 7

What I want to emphasize here is that we use partial derivative for solving generalized linear model such as logistic regression (but also other models including normal linear model). The fact of using partial derivative is very important -- because obtained coefficients are independently estimated in partial derivative equations and free from any third-party effects*2. In principle, generalized linear model and/or other multivariate statistics can avoid and/or exclude such third-party effects.

In essences, any multivariate data include third-party effect: univariate statistics like a simple analytics as shown at the top of this post*3 cannot avoid it, but only multivariate statistics can.

Unfortunately, usually simple analytics lovers are unwilling to use such multivariate statistics because most of them consider that learning such statistics may be hard and time-consuming. But once omitting multivariate statistics, anytime you can fall into a pitfall of third-party effects.

A trade-off between simple analytics and multivariate statistics

From a viewpoint of business solution, a relationship between simple analytics and multivariate statistics is just a trade-off.

Simple analytics is really simple, fast and easy to learn and implement. You can easily run it by Excel or any other BI or usual analytic tools. On the other hand, multivariate statistics is comparatively complicated, a little slow and hard to learn (and implement) for usual business persons.

In several missions, I've seen such a dichotomy as stated above: a lot of marketers using only simple analytics struggle with enormous data and report numerous results daily or hourly but sometimes incorrect, versus a few of data scientists or statisticians using full coverage of multivariate statistics or machine learning handle data and report... weekly or every 3 days but precisely correct in terms of data science.

I know it's a question without an answer, but I believe that at least we have to set priority on whether fast or slow, or whether sometimes incorrect or precisely correct. I myself prefer the latter, as a data scientist.

Misc

In the original post in Japanese-version, I visualized how a1,... a7 correlate with "Yes" or "No" by a graphical representation of the association rules, {arules} and {arulesViz} in R.

f:id:TJO:20130806180032p:plain

I drew this graph with Fruchterman-Reingold algorithm so that the nearer nodes are, the more (really) they correlate with each other. You can easily find that a6 correlates to "Yes" the most, but also that actually a7 is rather near to "Yes", not "No".

I believe this kind of visualization also helps you to understand how they*4 really correlate with each other.

*1:This simultaneous equations cannot be solved directly because the function F is too complicated; it can be solved by mainly numerical analysis methods

*2:But of course if it's free from multi-colinearity

*3:Simple one-by-one comparison using simple averaging on each category

*4:Both explanatory and dependent variables

Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Simple analytics work fast, but cannot avoid third-party effects

Partial correlation reveals third-party effects

Multivariate statistics can avoid it

A trade-off between simple analytics and multivariate statistics

Misc

a1	a2	a3	a4	a5	a6	a7	cv
1	1	1	0	1	1	0	Yes
0	1	0	1	0	0	0	No
0	0	0	1	1	1	0	Yes
1	0	0	1	1	1	0	Yes
0	0	1	1	0	0	1	No
...	...	...	...	...	...	...	...

a1	a2	a3	a4	a5	a6	a7	cv
1	1	1	0	1	1	0	Yes
0	1	0	1	0	0	0	No
0	0	0	1	1	1	0	Yes
1	0	0	1	1	1	0	Yes
0	0	1	1	0	0	1	No
...	...	...	...	...	...	...	...

a1	a2	a3	a4	a5	a6	a7	cv
1	1	1	0	1	1	0	Yes
0	1	0	1	0	0	0	No
0	0	0	1	1	1	0	Yes
1	0	0	1	1	1	0	Yes
0	0	1	1	0	0	1	No
...	...	...	...	...	...	...	...