2015-03-20

Machine learning for package users with R (1): Decision Tree

Notice

Currently {mvpart} CRAN package was removed from CRAN due to expiration of its support. For installation, 1) please download the latest (but expired) package archive from the old archive site and 2) install it following the procedure shown below.
<a href="http://stackoverflow.com/questions/17082341/installing-older-version-r-package">Installing older version of R package</a>
In short, for Mac or Linux R command below will work.
> install.packages("[URL of the pacakges' tar.gz here]", repos=NULL, type="source")
For Windows, first you have to edit PATH to include R folder, and then please type as below on Command Prompt or PowerSchell.
R CMD INSTALL XYZ.tar.gz

In the previous post, we saw how to evaluate a machine learning classifier using typical XOR patterns and drawing its decision boundary on the same XY plane. In this post, let's see how Decision Tree, one of the lightest machine learning classifier, works.

A brief trial on a short version of MNIST datasets

I think it's good to see how each classifier works on the typical MNIST datasets. Prior to writing this post, I prepared a short version of MNIST datasets and uploaded onto my GitHub repository below.

<a href="https://github.com/ozt-ca/tjo.hatenablog.samples/tree/master/r_samples/public_lib/jp/mnist_reproduced">ozt-ca/tjo.hatenablog.samples</a>

Please download 'short_prac_train.csv' and 'short_prac_test.csv' onto your R working folder and run as below.

2015-03-11

Machine learning for package users with R (0): Prologue

R machine learning MLpackage_R

Below is the most popular post in this blog that recorded an enormous number of PV and received a lot of comments even here or outside this blog.

<a href="http://tjo-en.hatenablog.com/entry/2014/01/06/234155">Comparing machine learning classifiers based on their hyperplanes or decision boundaries - Data Scientist in Ginza, Tokyo</a>

In this post, I discussed two aspects of performance of machine learning classifiers: overfitting and generalization. Of course they roughly correspond with bias-variance tradeoff. Low bias models can be easily overfitted and low variance models can be well generalized. In order to describe such a characteristic of each machine learning classifier, I attempted to apply one to a certain fixed dataset; "simple" or "complex" 2D XOR samples.

Indeed I wrote a series of posts running a similar experiment with each classifier, in the Japanese version of my blog. A title of the series can be read "Machine learning for package users with R".

<a href="http://tjo.hatenablog.com/archive/category/%E7%B3%9E%E3%82%B3%E3%83%BC%E3%83%89%E3%81%A7%E9%A0%91%E5%BC%B5%E3%82%8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA">糞コードで頑張る機械学習シリーズカテゴリーの記事一覧 - 銀座で働くデータサイエンティストのブログ</a>

I've heard not a few people request me to translate these posts into English and I agree it would be helpful also for overseas people, so in this series of posts I'll put them into English and discuss some important points for understanding their meaning.

2015-02-25

Deep Learning with {h2o} on MNIST dataset (and Kaggle competition)

R machine learning

In the previous post we saw how Deep Learning with {h2o} works and how Deep Belief Nets implemented by h2o.deeplearning draw decision boundaries for XOR patterns.

<a href="http://tjo-en.hatenablog.com/entry/2015/02/15/194003">What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} package - Data Scientist in Ginza, Tokyo</a>

Of course entirely the same framework can be applied to other general and usual datasets - including Kaggle competitions. For just a curiosity, I were looking for a free MNIST dataset and fortunately I found Kaggle provides it as below.

<a href="http://www.kaggle.com/c/digit-recognizer/">Description - Digit Recognizer | Kaggle</a>

I know Convolutional NN (ConvNet or CNN) better works for such a 2D image classification task than Deep Belief Net... there are some well-known and well-established libraries such as Caffe, CUDA-ConvNet, Torch7, etc., but they may take a little more to implement for (lazy) me. Here I ran a brief and quick trial with a MNIST dataset for h2o.deeplearning in order to check its performance.

Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Machine learning for package users with R (1): Decision Tree

Notice

A brief trial on a short version of MNIST datasets

Machine learning for package users with R (0): Prologue

Deep Learning with {h2o} on MNIST dataset (and Kaggle competition)