Almost two years ago, I wrote a post about the situation of "Data Scientist" and "Artificial Intelligence" at that time.
After two years have passed, now what's happening and what do we see? Below is a summary of current situation of data science, data scientist, artificial intelligence and related topics in Japan.Read more
Two years ago, I published a book -- written in Japanese so I'm afraid most of the readers can't read it :'(
Actually this book was written as a summary of 10 major data science methods. But as two years have gone, the content of the book is now out-of-date; obviously it needs further update, including some more advances in statistics and machine learning. Below is a list of the 10+2 methods that I believe every data scientist must know in 2016.
- Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)
- Multiple Regression (Linear Models)
- General Linear Models (GLM: Logistic Regression, Poisson Regression)
- Random Forest
- Xgboost (eXtreme Gradient Boosted Trees)
- Deep Learning
- Bayesian Modeling with MCMC
- K-means Clustering
- Graph Theory & Network Analysis
- (A1) Latent Dirichlet Allocation & Topic Modeling
- (A2) Factorization (SVD, NMF)
The first 10 methods are the ones I know well and indeed I'm running in my daily works, but I've never tried the last 2 methods by my own hand for actual business and they've been run by my colleagues in my previous job, although I'm also familiar with them as an operation manager. So the former includes R or Python scripts to run as practical examples, while the latter only includes ordinary examples provided by help sources. Some of them require gcc / clang compiler, or Java runtime environment such as H2O. OK, let's go.
- This post gives you a 'perspective' for those who want to overview all of the methods; there may be some less strict and/or incorrect description, and it was not supposed to provide any knowledge on implementation from scratch.
- Please search how to install packages, libraries or other build environment by yourself.
- Any comments or critiques on any incorrect points in this post are welcome.
Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it.
MXnet is a framework distributed by DMLC, the team also known as a distributor of Xgboost. Now its documentation looks to be completed and even pre-trained models for ImageNet are distributed. I think this should be a good news for R-users loving machine learning... so let's go.Read more