Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Machine learning for package users with R (1): Decision Tree

Notice

Currently {mvpart} CRAN package was removed from CRAN due to expiration of its support. For installation, 1) please download the latest (but expired) package archive from the old archive site and 2) install it following the procedure shown below.

In short, for Mac or Linux R command below will work.

> install.packages("[URL of the pacakges' tar.gz here]", repos=NULL, type="source")

For Windows, first you have to edit PATH to include R folder, and then please type as below on Command Prompt or PowerSchell.

R CMD INSTALL XYZ.tar.gz


In the previous post, we saw how to evaluate a machine learning classifier using typical XOR patterns and drawing its decision boundary on the same XY plane. In this post, let's see how Decision Tree, one of the lightest machine learning classifier, works.


A brief trial on a short version of MNIST datasets


I think it's good to see how each classifier works on the typical MNIST datasets. Prior to writing this post, I prepared a short version of MNIST datasets and uploaded onto my GitHub repository below.

Please download 'short_prac_train.csv' and 'short_prac_test.csv' onto your R working folder and run as below.

Read more

Machine learning for package users with R (0): Prologue

Below is the most popular post in this blog that recorded an enormous number of PV and received a lot of comments even here or outside this blog.



In this post, I discussed two aspects of performance of machine learning classifiers: overfitting and generalization. Of course they roughly correspond with bias-variance tradeoff. Low bias models can be easily overfitted and low variance models can be well generalized. In order to describe such a characteristic of each machine learning classifier, I attempted to apply one to a certain fixed dataset; "simple" or "complex" 2D XOR samples.


Indeed I wrote a series of posts running a similar experiment with each classifier, in the Japanese version of my blog. A title of the series can be read "Machine learning for package users with R".



I've heard not a few people request me to translate these posts into English and I agree it would be helpful also for overseas people, so in this series of posts I'll put them into English and discuss some important points for understanding their meaning.

Read more

Deep Learning with {h2o} on MNIST dataset (and Kaggle competition)

In the previous post we saw how Deep Learning with {h2o} works and how Deep Belief Nets implemented by h2o.deeplearning draw decision boundaries for XOR patterns.

Of course entirely the same framework can be applied to other general and usual datasets - including Kaggle competitions. For just a curiosity, I were looking for a free MNIST dataset and fortunately I found Kaggle provides it as below.

I know Convolutional NN (ConvNet or CNN) better works for such a 2D image classification task than Deep Belief Net... there are some well-known and well-established libraries such as Caffe, CUDA-ConvNet, Torch7, etc., but they may take a little more to implement for (lazy) me. Here I ran a brief and quick trial with a MNIST dataset for h2o.deeplearning in order to check its performance.

Read more