Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} package

For a while (at least several months since many people began to implement it with Python and/or Theano, PyLearn2 or something like that), nearly I've given up practicing Deep Learning with R and I've felt I was left alone much further away from advanced technology...


But now we (not only I!) have a great masterpiece: {h2o}, an implementation of H2O framework in R. I believe {h2o} is the easiest way of applying Deep Learning technique to our own datasets because we don't have to even write any code scripts but only to specify some of its parameters. That is, using {h2o} we are free from complicated codes; we can only focus on its underlying essences and theories.


With using {h2o} on R, in principle we can implement "Deep Belief Net", that is the original version of Deep Learning*1. I know it's already not the state-of-the-art style of Deep Learning, but it must be helpful for understanding how Deep Learning works on actual datasets. Please remember a previous post of this blog that argues about how decision boundaries tell us how each classifier works in terms of overfitting or generalization, if you already read this blog. :)

It's much simple how to tell which overfits or well gets generalized with the given dataset generated by 4 sets of fixed 2D normal distribution. My points are: 1) if decision boundaries look well smoothed, they're well generalized, 2) if they look too complicated, they're overfitting, because underlying true distributions can be clearly divided into 4 quadrants with 2 perpendicular axes.


OK, let's run the same trial with Deep Learning of {h2o} on R in order to see how DL works on the given dataset.

*1:Of course here I think the latest topic of Deep Learning is "ConvNet" or convolutional neural network with convolution and max pooling

Read more

Visualizing supervised machine learning with association rules and graphical modeling

On Apr 17, I joined Global TokyoR #1 and talked about a stuff below.



(Note: please install {igraph} package before installing {arulesViz})


By the way, the main guest of this seminar was Markus Gessmann, who is a developer and maintainer of CRAN package {googleVis}. His presentation was great. Enjoy via a link below.


Interactive charts with googleVis in R


We Japanese around Tokyo metropolitan area have some regular seminars or meetings on data science and its technology, so if you want to join any of them when you visit Tokyo, don't hesitate to ask me!


P.S.


Mr. Gressman commented on his blog as below.

http://www.magesblog.com/2014/04/notes-from-tokyo-r-user-group-meeting.html

Answers to "10 questions about big data and data science" from Japan

I read a set of much interesting questions by Dr. Vincent Graville as below:

  1. Should companies embrace big data? Which ones (start-ups, big-companies, tech companies, retail, health care)? And how? Using vendors, outsourcing or by hiring employees? And how do you measure ROI on big data? Should they use redundant data to consolidate KPI's?
  2. What do you consider to be big data? I tend to think of big data as anything 10 times larger (in terms of megabytes per day) than the maximum you are used to. Also, sparse data might not be as big as they look, can be costly to process. Is there a price per megabyte, for big data storage, big data transfers, and big data analytics?
  3. How did you become interested in data science?
  4. What is the difference between data science, statistics, machine learning, and data engineering? Do you think an hybrid role (cross-disciplines) would be helpful (helpful to small companies, or helpful to the analytic practitioner as it opens up more job opportunities?
  5. What kind of training do you recommend for future data scientists? Any specific program in mind?
  6. How to get university professors more involved in teaching students how to process real live, big data sets? Should curricula be adapted, outdated material removed, new material introduced?
  7. During my first year in my PhD program, I worked part-time for a high-tech small company, in partnership with my stats lab. This was a great experience - being exposed to the real world, and decently paid to do my PhD (in Belgium in 1988). How to encourage such initiatives in US?
  8. Besides Hadoop-like and graph database environments, do you see other technology that would made data plumbing easier for big data?
  9. Does it make sense to try to structure un-structured data (using tags, NLP, taxonomies, etc.)?
  10. Can you tell me 5 business activities that would benefit most from big data, and 5 that would benefit least?


I think these are very usual questions about big data and data science outside Japan. But as far as I've known, there is no argument like above yet in Japan, because a culture of data science is getting less attention here... I'm afraid nobody answers them from Japan.


Although I'm never a leader in data science or big data field in Japan, I dare answer the questions above in order to report a situation in Japanese market related to topics the questions mentioned, to people in the global market.

Read more