Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

machine learning

Undersampling + bagging = better generalized classification for imbalanced dataset

This post is reproduced from a post of my Japanese blog.A friend of mine, an academic researcher in machine learning field tweeted as below.imbalanced data に対する対処を勉強していたのだけど,[Wallace et al. ICDM'11] https://t.co/ltQ942lKP…

What kinds of mathematics are needed if you want to learn machine learning

This post is a reproduced version of the post in my Japanese blog.For years, a lot of beginners in machine learning have asked me such as "Do I have to learn mathematics? What kind? To what extent?" and sometimes I've found it very hard to…

10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'( Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book …

{mxnet} R package from MXnet, an intuitive Deep Learning framework including CNN & RNN

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it. MXn…

Can multivariate modeling predict taste of wine? Beyond human intuition and univariate reductionism

Taste of Wine vs. Data Science from Takashi J OZAKI At a certain meetup on the other day, I talked about a brand-new relationship between taste of wine (i.e. professional tasting) and data science. This talk was inspired by a book "Wine Sc…

Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative component…

Machine learning for package users with R (6): Xgboost (eXtreme Gradient Boosting)

As far as I've known, Xgboost is the most successful machine learning classifier in several competitions in machine learning, e.g. Kaggle or KDD cups. Indeed the team winning Higgs-Boson competition used Xgboost and below is their code rel…

Machine learning for package users with R (5): Random Forest

Random Forest is still one of the strongest supervised learning methods although these days many people love to use Deep Learning or Convolutional NN. Of course because it's simple architecture and a lot of implementation in various enviro…

Machine learning for package users with R (4): Neural Network

These days almost everybody appears to love a variation of Neural Network (NN) -- Deep Learning. I already argued about how Deep Learning works and what kind of parameters characterizes it in the previous post. What kind of decision bounda…

Machine learning for package users with R (3): Support Vector Machine

Actually support vector machine (SVM) is the one that I love the most among various machine learning classifiers... because of its strong generalization and beautiful decision boundary (in high dimensional space). Although there are other …

Machine learning for package users with R (2): Logistic Regression

I think a lot of people love logistic regression because it's pretty light and fast. But we know it's just a linear classifying function -- I mean it's only for linearly separable patterns, not linearly non-separable ones. It's primary ide…

Machine learning for package users with R (1): Decision Tree

Notice Currently {mvpart} CRAN package was removed from CRAN due to expiration of its support. For installation, 1) please download the latest (but expired) package archive from the old archive site and 2) install it following the procedur…

Machine learning for package users with R (0): Prologue

Below is the most popular post in this blog that recorded an enormous number of PV and received a lot of comments even here or outside this blog. Comparing machine learning classifiers based on their hyperplanes or decision boundaries - Da…

Deep Learning with {h2o} on MNIST dataset (and Kaggle competition)

In the previous post we saw how Deep Learning with {h2o} works and how Deep Belief Nets implemented by h2o.deeplearning draw decision boundaries for XOR patterns.What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? P…

What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} package

For a while (at least several months since many people began to implement it with Python and/or Theano, PyLearn2 or something like that), nearly I've given up practicing Deep Learning with R and I've felt I was left alone much further away…

Visualizing supervised machine learning with association rules and graphical modeling

On Apr 17, I joined Global TokyoR #1 and talked about a stuff below. Visualization of Supervised Learning with {arules} + {arulesViz} from Takashi J Ozaki (Note: please install {igraph} package before installing {arulesViz}) By the way, th…

Comparing machine learning classifiers based on their hyperplanes or decision boundaries

In Japanese version of this blog, I've written a series of posts about how each kind of machine learning classifiers draws various classification hyperplanes or decision boundaries. So in this post I want to show you a summary of the serie…