Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Undersampling + bagging = better generalized classification for imbalanced dataset

This post is reproduced from a post of my Japanese blog.A friend of mine, an academic researcher in machine learning field tweeted as below.imbalanced data に対する対処を勉強していたのだけど,[Wallace et al. ICDM'11] https://t.co/ltQ942lKP…

What kinds of mathematics are needed if you want to learn machine learning

This post is a reproduced version of the post in my Japanese blog.For years, a lot of beginners in machine learning have asked me such as "Do I have to learn mathematics? What kind? To what extent?" and sometimes I've found it very hard to…

In Japan, now "Artificial Intelligence" comes to be a super star, while "Data Scientist" has been forgotten

Almost two years ago, I wrote a post about the situation of "Data Scientist" and "Artificial Intelligence" at that time.After two years have passed, now what's happening and what do we see? Below is a summary of current situation of data s…

10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'( Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book …

{mxnet} R package from MXnet, an intuitive Deep Learning framework including CNN & RNN

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it. MXn…

Can multivariate modeling predict taste of wine? Beyond human intuition and univariate reductionism

Taste of Wine vs. Data Science from Takashi J OZAKI At a certain meetup on the other day, I talked about a brand-new relationship between taste of wine (i.e. professional tasting) and data science. This talk was inspired by a book "Wine Sc…

Bayesian modeling with R and Stan (5): Time series with seasonality

In the previous post, we successfully estimated a model with a nonlinear trend by using Stan. But please remember this is a time series dataset. Does it include any other kind of nonlinear components? Yes, we have to be careful for seasona…

Bayesian modeling with R and Stan (4): Time series with a nonlinear trend

The previous post reviewed how to estimate a simple hierarchical Bayesian models. You can see more complicated cases in a great textbook "The BUGS book". But personally hierarchical Bayesian modeling is the most useful for time-series anal…

Bayesian modeling with R and Stan (3): Simple hierarchical Bayesian model

In 2 previous posts, you learned what Bayesian modeling and Stan are and how to install them. Now you are ready to try it on some very Bayesian problems - as many people love - such as hierarchical Bayesian model. Definition of hierarchica…

Bayesian modeling with R and Stan (2): Installation and an easy example

The previous post overviewed what and how is Stan on R. Bayesian modeling with R and Stan (1): Overview - Data Scientist in Ginza, Tokyo Are you ready now? OK, this post reviews how to install Stan. Let's start here! :) In principle this p…

Bayesian modeling with R and Stan (1): Overview

Although I've written a series of posts titled "Machine Learning for package uses in R", usually I don't run machine learning on daily analytic works because my current coverage is so-called an ad-hoc analysis. Instead of machine learning,…

Univariate stats sometimes fail, while multivariate modelings work well

In many cases of digital marketing especially if it's online, marketers or analysts usually love to apply A/B tests in order to find the most influential metric on KGI/KPIs from a huge set of explanatory metrics, such as creative component…

Machine learning for package users with R (6): Xgboost (eXtreme Gradient Boosting)

As far as I've known, Xgboost is the most successful machine learning classifier in several competitions in machine learning, e.g. Kaggle or KDD cups. Indeed the team winning Higgs-Boson competition used Xgboost and below is their code rel…

Machine learning for package users with R (5): Random Forest

Random Forest is still one of the strongest supervised learning methods although these days many people love to use Deep Learning or Convolutional NN. Of course because it's simple architecture and a lot of implementation in various enviro…

Machine learning for package users with R (4): Neural Network

These days almost everybody appears to love a variation of Neural Network (NN) -- Deep Learning. I already argued about how Deep Learning works and what kind of parameters characterizes it in the previous post. What kind of decision bounda…

Machine learning for package users with R (3): Support Vector Machine

Actually support vector machine (SVM) is the one that I love the most among various machine learning classifiers... because of its strong generalization and beautiful decision boundary (in high dimensional space). Although there are other …

Machine learning for package users with R (2): Logistic Regression

I think a lot of people love logistic regression because it's pretty light and fast. But we know it's just a linear classifying function -- I mean it's only for linearly separable patterns, not linearly non-separable ones. It's primary ide…

Machine learning for package users with R (1): Decision Tree

Notice Currently {mvpart} CRAN package was removed from CRAN due to expiration of its support. For installation, 1) please download the latest (but expired) package archive from the old archive site and 2) install it following the procedur…

Machine learning for package users with R (0): Prologue

Below is the most popular post in this blog that recorded an enormous number of PV and received a lot of comments even here or outside this blog. Comparing machine learning classifiers based on their hyperplanes or decision boundaries - Da…

Deep Learning with {h2o} on MNIST dataset (and Kaggle competition)

In the previous post we saw how Deep Learning with {h2o} works and how Deep Belief Nets implemented by h2o.deeplearning draw decision boundaries for XOR patterns.What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? P…

In Japan "Data Scientist" has gone and "Artificial Intelligence" is explosively rising

More than a year ago, I pointed out that "Data Scientist" has attracted less attention than ever.Puzzling situation of "Data Scientist" in Japanese market - Data Scientist in Ginza, TokyoSo, what's going on in 2015?... yes, I think not a f…

What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} package

For a while (at least several months since many people began to implement it with Python and/or Theano, PyLearn2 or something like that), nearly I've given up practicing Deep Learning with R and I've felt I was left alone much further away…

Visualizing supervised machine learning with association rules and graphical modeling

On Apr 17, I joined Global TokyoR #1 and talked about a stuff below. Visualization of Supervised Learning with {arules} + {arulesViz} from Takashi J Ozaki (Note: please install {igraph} package before installing {arulesViz}) By the way, th…

Answers to "10 questions about big data and data science" from Japan

I read a set of much interesting questions by Dr. Vincent Graville as below: 10 questions about big data and data science - Data Science Central Should companies embrace big data? Which ones (start-ups, big-companies, tech companies, retai…

Simple analytics work fast, but cannot avoid third-party effects

(The original posts in Japanese version are here and here ) In Japan, from my own experience, there may be a dichotomy between "analytics" and "data science". It has been said that real business matters require rapid analyses and rapid act…

Pitfall of "regression to the mean" in growth hacking

(The original post in Japanese version is here) In several marketing teams that I've worked on or from not a few people in the other companies for marketing, I've heard some complaints as follows: "We're working hard to improve and optimiz…

Puzzling situation of "Data Scientist" in Japanese market

I'm nothing but a Data Scientist in a company -- but at the same time I'm working as an evangelist of data science and data scientists themselves. I've been watching how people think of data scientists and how deep they are accepted in Jap…

Comparing machine learning classifiers based on their hyperplanes or decision boundaries

In Japanese version of this blog, I've written a series of posts about how each kind of machine learning classifiers draws various classification hyperplanes or decision boundaries. So in this post I want to show you a summary of the serie…

Greetings from Ginza in Tokyo, Japan

Hello everybody in data science community -- I'm TJO from Ginza, Tokyo. Ginza is one of the most bustling downtowns in not only Tokyo, but also all over Japan. After 6 years academic career in experimental neuroscience, I moved to data sci…