Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'(


Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book is now out-of-date; obviously it needs further update, including some more advances in statistics and machine learning. Below is a list of the 10+2 methods that I believe every data scientist must know in 2016.

  1. Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)
  2. Multiple Regression (Linear Models)
  3. General Linear Models (GLM: Logistic Regression, Poisson Regression)
  4. Random Forest
  5. Xgboost (eXtreme Gradient Boosted Trees)
  6. Deep Learning
  7. Bayesian Modeling with MCMC
  8. word2vec
  9. K-means Clustering
  10. Graph Theory & Network Analysis
  • (A1) Latent Dirichlet Allocation & Topic Modeling
  • (A2) Factorization (SVD, NMF)


The first 10 methods are the ones I know well and indeed I'm running in my daily works, but I've never tried the last 2 methods by my own hand for actual business and they've been run by my colleagues in my previous job, although I'm also familiar with them as an operation manager. So the former includes R or Python scripts to run as practical examples, while the latter only includes ordinary examples provided by help sources. Some of them require gcc / clang compiler, or Java runtime environment such as H2O. OK, let's go.

Disclaimer

  • This post gives you a 'perspective' for those who want to overview all of the methods; there may be some less strict and/or incorrect description, and it was not supposed to provide any knowledge on implementation from scratch.
  • Please search how to install packages, libraries or other build environment by yourself.
  • Any comments or critiques on any incorrect points in this post are welcome.
Read more

{mxnet} R package from MXnet, an intuitive Deep Learning framework including CNN & RNN

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it.



MXnet is a framework distributed by DMLC, the team also known as a distributor of Xgboost. Now its documentation looks to be completed and even pre-trained models for ImageNet are distributed. I think this should be a good news for R-users loving machine learning... so let's go.

Read more

Can multivariate modeling predict taste of wine? Beyond human intuition and univariate reductionism


At a certain meetup on the other day, I talked about a brand-new relationship between taste of wine (i.e. professional tasting) and data science. This talk was inspired by a book "Wine Science: The Application of Science in Winemaking". Below is its Japanese edition.


新しいワインの科学

新しいワインの科学


For readers who can't read Japanese, I summarized the content of the talk in this post. Just for your information, I myself am a super wine lover :) and I'm also much interested in how data science explain taste of wine.


In order to run analytics below, I prepared an R workspace in my GitHub repository. Please download and import it into your R environment.



(Note: All quotes from Goode's book here are reversely translated from the Japanese edition and it may contain not a few difference from the original version)

Read more