Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

What kinds of mathematics are needed if you want to learn machine learning

This post is a reproduced version of the post in my Japanese blog.

For years, a lot of beginners in machine learning have asked me such as "Do I have to learn mathematics? What kind? To what extent?" and sometimes I've found it very hard to explain in a few words. Very fortunately, once I learned linear algebra and calculus when I was a student in an department of engineering so it's very familiar to me and useful for understanding theoretical aspects of machine learning.

But recently more and more beginners are rushing into machine learning or "AI" field to get more opportunity or even jobs. As far as I've seen, some of such people have never learned university-level mathematics although ML requires them. Very much unfortunately, most of them already graduated from their university years ago and they have little opportunity for learning mathematics in class. I think they need a kind of guidelines for learning mathematics for understanding machine learning.

In this post, I'd like to review a few kinds of mathematics that may be required for understanding modern machine learning, for such beginners. FYI I have one disclaimer: I'm never mathematical expert, so there might be incorrect or wrong points in terms of mathematics. If you see any points, don't hesitate to let me know!

Read more

In Japan, now "Artificial Intelligence" comes to be a super star, while "Data Scientist" has been forgotten

Almost two years ago, I wrote a post about the situation of "Data Scientist" and "Artificial Intelligence" at that time.

After two years have passed, now what's happening and what do we see? Below is a summary of current situation of data science, data scientist, artificial intelligence and related topics in Japan.

Read more

10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'(

Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book is now out-of-date; obviously it needs further update, including some more advances in statistics and machine learning. Below is a list of the 10+2 methods that I believe every data scientist must know in 2016.

  1. Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)
  2. Multiple Regression (Linear Models)
  3. General Linear Models (GLM: Logistic Regression, Poisson Regression)
  4. Random Forest
  5. Xgboost (eXtreme Gradient Boosted Trees)
  6. Deep Learning
  7. Bayesian Modeling with MCMC
  8. word2vec
  9. K-means Clustering
  10. Graph Theory & Network Analysis
  • (A1) Latent Dirichlet Allocation & Topic Modeling
  • (A2) Factorization (SVD, NMF)

The first 10 methods are the ones I know well and indeed I'm running in my daily works, but I've never tried the last 2 methods by my own hand for actual business and they've been run by my colleagues in my previous job, although I'm also familiar with them as an operation manager. So the former includes R or Python scripts to run as practical examples, while the latter only includes ordinary examples provided by help sources. Some of them require gcc / clang compiler, or Java runtime environment such as H2O. OK, let's go.


  • This post gives you a 'perspective' for those who want to overview all of the methods; there may be some less strict and/or incorrect description, and it was not supposed to provide any knowledge on implementation from scratch.
  • Please search how to install packages, libraries or other build environment by yourself.
  • Any comments or critiques on any incorrect points in this post are welcome.
Read more