2018-05-22

Undersampling + bagging = better generalized classification for imbalanced dataset

R machine learning

This post is reproduced from a post of my Japanese blog.

A friend of mine, an academic researcher in machine learning field tweeted as below.

imbalanced data に対する対処を勉強していたのだけど，[Wallace et al. ICDM'11] https://t.co/ltQ942lKPm … で「undersampling + bagging をせよ」という結論が出ていた．
— ™ (@tmaehara) July 29, 2017

I've studied how to handle imbalanced data, but I found Wallace et al. ICDM'11 concluded that you should do "undersampling + bagging".

In the other post of my Japanese blog, I argued about how to handle imbalanced data with "class weight" in which cost of negative samples is reduced by a ratio of negative to positive samples in loss function.

However, I thought "undersampling + bagging" would work better, so I would try it here. Please notice that here I only used randomForest {randomForest} in R for this trial just for simplicity and computational cost. If you're interested in any other classifier including deep NN, please try by yourself :P)

Note

If you use Python, already there is a good package for "undersampling + bagging".

2018-05-14

What kinds of mathematics are needed if you want to learn machine learning

Python machine learning

This post is a reproduced version of the post in my Japanese blog.

For years, a lot of beginners in machine learning have asked me such as "Do I have to learn mathematics? What kind? To what extent?" and sometimes I've found it very hard to explain in a few words. Very fortunately, once I learned linear algebra and calculus when I was a student in an department of engineering so it's very familiar to me and useful for understanding theoretical aspects of machine learning.

But recently more and more beginners are rushing into machine learning or "AI" field to get more opportunity or even jobs. As far as I've seen, some of such people have never learned university-level mathematics although ML requires them. Very much unfortunately, most of them already graduated from their university years ago and they have little opportunity for learning mathematics in class. I think they need a kind of guidelines for learning mathematics for understanding machine learning.

In this post, I'd like to review a few kinds of mathematics that may be required for understanding modern machine learning, for such beginners. FYI I have one disclaimer: I'm never mathematical expert, so there might be incorrect or wrong points in terms of mathematics. If you see any points, don't hesitate to let me know!

2017-01-13

In Japan, now "Artificial Intelligence" comes to be a super star, while "Data Scientist" has been forgotten

Japan data scientist data science business

Almost two years ago, I wrote a post about the situation of "Data Scientist" and "Artificial Intelligence" at that time.

After two years have passed, now what's happening and what do we see? Below is a summary of current situation of data science, data scientist, artificial intelligence and related topics in Japan.

Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Undersampling + bagging = better generalized classification for imbalanced dataset

What kinds of mathematics are needed if you want to learn machine learning

In Japan, now "Artificial Intelligence" comes to be a super star, while "Data Scientist" has been forgotten