Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Undersampling + bagging = better generalized classification for imbalanced dataset

This post is reproduced from a post of my Japanese blog.

A friend of mine, an academic researcher in machine learning field tweeted as below.

I've studied how to handle imbalanced data, but I found Wallace et al. ICDM'11 concluded that you should do "undersampling + bagging".

In the other post of my Japanese blog, I argued about how to handle imbalanced data with "class weight" in which cost of negative samples is reduced by a ratio of negative to positive samples in loss function.

However, I thought "undersampling + bagging" would work better, so I would try it here. Please notice that here I only used randomForest {randomForest} in R for this trial just for simplicity and computational cost. If you're interested in any other classifier including deep NN, please try by yourself :P)

Note

If you use Python, already there is a good package for "undersampling + bagging".

Dataset


I prepared a dataset with 250 positive and 3750 negative samples. Please get it from my GitHub repository below.

First, import it as a data frame "d". In addition, let's create a grid to draw decision boundaries.

> px <- seq(-4,4,0.05)
> py <- seq(-4,4,0.05)
> pgrid <- expand.grid(px, py)
> names(pgrid) <- names(d)[-3]

OK, let's proceed.


Class weight


If you use randomForest, what you have to do is just to give a ratio of positive to negative samples to "classwt" argument.

> d.rf <- randomForest(as.factor(label)~., d, classwt=c(1, 3750/250))
> out.rf <- predict(d.rf, newdata=pgrid)
> plot(d[,-3], col=d[,3]+1, xlim=c(-4,4), ylim=c(-4,4), cex=0.5, pch=19)
> par(new=T)
> contour(px, py, array(out.rf, c(length(px), length(py))), levels=0.5, col='purple', lwd=5, drawlabels=F)

f:id:TJO:20170811154742p:plain

The result was as I expected; a little more expanded decision boundary than its original one.


Undersampling + bagging


First, I tried a bagging with 10 sub-classifiers. Please accept my dirty codes :P)

> outbag.rf <- c()
> for (i in 1:10){
+     set.seed(i)
+     train0 <- d[sample(3750, 250, replace=F),]
+     train1 <- d[3751:4000,]
+     train <- rbind(train0, train1)
+     model <- randomForest(as.factor(label)~., train)
+     tmp <- predict(model, newdata=pgrid)
+     outbag.rf <- cbind(outbag.rf, tmp)
+ }
> outbag.rf.grid <- apply(outbag.rf, 1, mean)-1
> plot(d[,-3], col=d[,3]+1, xlim=c(-4,4), ylim=c(-4,4), cex=0.5, pch=19)
> par(new=T)
> contour(px, py, array(out10.rf.grid, c(length(px), length(py))), levels=0.5, col='purple', lwd=5, drawlabels=F)

f:id:TJO:20170811155719p:plain

It looks that the classification region of the positive samples got expanded a little bit. How about 50 sub-classifiers?

f:id:TJO:20170811160052p:plain

It appears that jaggy parts got reduced. OK, let's try 100 sub-classifiers.

f:id:TJO:20170811160311p:plain

Got it, once it finished.


Advanced: positive samples as embedded into negative samples


The dataset above was generated by a script below.

> set.seed(1001)
> x1 <- cbind(rnorm(1000, 1, 1), rnorm(1000, 1, 1))
> set.seed(1002)
> x2 <- cbind(rnorm(1000, -1, 1), rnorm(1000, 1, 1))
> set.seed(1003)
> x3 <- cbind(rnorm(1000, -1, 1), rnorm(1000, -1, 1))
> set.seed(4001)
> x41 <- cbind(rnorm(250, 0.5, 0.5), rnorm(250, -0.5, 0.5))
> set.seed(4002)
> x42 <- cbind(rnorm(250, 1, 0.5), rnorm(250, -0.5, 0.5))
> set.seed(4003)
> x43 <- cbind(rnorm(250, 0.5, 0.5), rnorm(250, -1, 0.5))
> set.seed(4004)
> x44 <- cbind(rnorm(250, 1, 0.5), rnorm(250, -1, 0.5))
> d <- rbind(x1,x2,x3,x41,x42,x43,x44)
> d <- data.frame(x = d[,1], y = d[,2], label=c(rep(0, 3750), rep(1, 250)))

In short, positive samples in this dataset are concentrated in the lowest and rightmost part of the 4th quadrant, so I think it's natural that we got a result above. Thus, I tried to move the positive samples more inside, as embedded into the negative samples.

> d1 <- d
> d1$label <- c(rep(0,3000), rep(1,250), rep(0,750))

First, let's try "classwt".

> d1.rf <- randomForest(as.factor(label)~., d1, classwt=c(1, 3750/250))
> out.rf <- predict(d1.rf, newdata=pgrid)
> plot(d1[,-3], col=d1[,3]+1, xlim=c(-4,4), ylim=c(-4,4), cex=0.5, pch=19)
> par(new=T)
> contour(px, py, array(out.rf, c(length(px), length(py))), levels=0.5, col='purple', lwd=5, drawlabels=F)

f:id:TJO:20170811160906p:plain

OMG... what a tiny decision boundary, even like overfitting. OK, let's try "undersampling + bagging". For simplicity, I tried 100 sub-classifiers here.

> outbag.rf <- c()
> for (i in 1:100){
+ set.seed(i)
+ train.tmp <- d1[d1$label==0,]
+ train0 <- train.tmp[sample(3750, 250, replace=F),]
+ train1 <- d1[3001:3250,]
+ train <- rbind(train0, train1)
+ model <- randomForest(as.factor(label)~., train)
+ tmp <- predict(model, newdata=pgrid)
+ outbag.rf <- cbind(outbag.rf, tmp)
+ }
> outbag.rf.grid <- apply(outbag.rf, 1, mean)-1
> plot(d1[,-3], col=d1[,3]+1, xlim=c(-4,4), ylim=c(-4,4), cex=0.5, pch=19)
> par(new=T)
> contour(px, py, array(outbag.rf.grid, c(length(px), length(py))), levels=0.5, col='purple', lwd=5, drawlabels=F)

f:id:TJO:20170811161356p:plain

Yes, I didi it. You see a broad decision boundary in the 4th quadrant just near to (0,0), which looks to get a little bit more generalized than usual.


My comment


Two cases clearly showed that "undersampling + bagging" does better than "class weight" in order to get well-generalized decision boundary. To tell the truth, I rather got shocked because "classwt" causes overfitting shown in these two cases...