Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

10+2 Data Science Methods that Every Data Scientist Should Know in 2016

Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'(


Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book is now out-of-date; obviously it needs further update, including some more advances in statistics and machine learning. Below is a list of the 10+2 methods that I believe every data scientist must know in 2016.

  1. Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)
  2. Multiple Regression (Linear Models)
  3. General Linear Models (GLM: Logistic Regression, Poisson Regression)
  4. Random Forest
  5. Xgboost (eXtreme Gradient Boosted Trees)
  6. Deep Learning
  7. Bayesian Modeling with MCMC
  8. word2vec
  9. K-means Clustering
  10. Graph Theory & Network Analysis
  • (A1) Latent Dirichlet Allocation & Topic Modeling
  • (A2) Factorization (SVD, NMF)


The first 10 methods are the ones I know well and indeed I'm running in my daily works, but I've never tried the last 2 methods by my own hand for actual business and they've been run by my colleagues in my previous job, although I'm also familiar with them as an operation manager. So the former includes R or Python scripts to run as practical examples, while the latter only includes ordinary examples provided by help sources. Some of them require gcc / clang compiler, or Java runtime environment such as H2O. OK, let's go.

Disclaimer

  • This post gives you a 'perspective' for those who want to overview all of the methods; there may be some less strict and/or incorrect description, and it was not supposed to provide any knowledge on implementation from scratch.
  • Please search how to install packages, libraries or other build environment by yourself.
  • Any comments or critiques on any incorrect points in this post are welcome.
Read more