Data Scientist TJO in Tokyo

Data science, statistics or machine learning in broken English

Answers to "10 questions about big data and data science" from Japan

I read a set of much interesting questions by Dr. Vincent Graville as below:

  1. Should companies embrace big data? Which ones (start-ups, big-companies, tech companies, retail, health care)? And how? Using vendors, outsourcing or by hiring employees? And how do you measure ROI on big data? Should they use redundant data to consolidate KPI's?
  2. What do you consider to be big data? I tend to think of big data as anything 10 times larger (in terms of megabytes per day) than the maximum you are used to. Also, sparse data might not be as big as they look, can be costly to process. Is there a price per megabyte, for big data storage, big data transfers, and big data analytics?
  3. How did you become interested in data science?
  4. What is the difference between data science, statistics, machine learning, and data engineering? Do you think an hybrid role (cross-disciplines) would be helpful (helpful to small companies, or helpful to the analytic practitioner as it opens up more job opportunities?
  5. What kind of training do you recommend for future data scientists? Any specific program in mind?
  6. How to get university professors more involved in teaching students how to process real live, big data sets? Should curricula be adapted, outdated material removed, new material introduced?
  7. During my first year in my PhD program, I worked part-time for a high-tech small company, in partnership with my stats lab. This was a great experience - being exposed to the real world, and decently paid to do my PhD (in Belgium in 1988). How to encourage such initiatives in US?
  8. Besides Hadoop-like and graph database environments, do you see other technology that would made data plumbing easier for big data?
  9. Does it make sense to try to structure un-structured data (using tags, NLP, taxonomies, etc.)?
  10. Can you tell me 5 business activities that would benefit most from big data, and 5 that would benefit least?


I think these are very usual questions about big data and data science outside Japan. But as far as I've known, there is no argument like above yet in Japan, because a culture of data science is getting less attention here... I'm afraid nobody answers them from Japan.


Although I'm never a leader in data science or big data field in Japan, I dare answer the questions above in order to report a situation in Japanese market related to topics the questions mentioned, to people in the global market.


A1: Only big-companies should do, in-house, with various data experts, in Japan

Q1: Should companies embrace big data? Which ones (start-ups, big-companies, tech companies, retail, health care)? And how? Using vendors, outsourcing or by hiring employees? And how do you measure ROI on big data? Should they use redundant data to consolidate KPI's?


As far as I've seen in Japan, only big companies are embracing big data and I think it's reasonable, because only they can afford to struggle such a huge data -- while others cannot.


As described in a previous post of this blog, in general Japanese people don't rather trust big data or any kinds of data science or analytics in an objective manner. Uniformity is a remarkable feature of this country -- if executives don't like big data or data science, probably neither employees do in case of small companies.


In Japan, only big companies have some diversity*1 and they can accept big data or data science. They love in-house analytics and hiring own employees, rather than out-sourcing.


As far as I know, still there are only a few companies that can measure ROI on big data. Big data and data science here are still immature.


A2: As huge as no desktop machines can handle directly

What do you consider to be big data? I tend to think of big data as anything 10 times larger (in terms of megabytes per day) than the maximum you are used to. Also, sparse data might not be as big as they look, can be costly to process. Is there a price per megabyte, for big data storage, big data transfers, and big data analytics?


As a joke, some IT experts say "Big data is data that cannot be handled by MS Excel". I think it's not so bad -- indeed I'm struggling with too huge data only by using a Hadoop + Hive framework.


Usually after preprocessing data on a Hadoop + Hive framework, I download such summarized data and run any kinds of data mining on my desktop machine, with a large size of physical memory and storage. I think big data is as huge as no desktop machines can handle.

A3: Just I required any job on which my data science skills work, after my postdoc career

How did you become interested in data science?


That's all above :P) Once I used to be a postdoc in neuroscience field*2 for 6 years and I saw a lot of problems in the academic career, including a serious difficulty in employment*3. Data science in the corporate sector was a brilliant job for me at that time.


A4: The difference depends on their origin in each academic field, and any hybrid role would be welcome especially in Japan

What is the difference between data science, statistics, machine learning, and data engineering? Do you think an hybrid role (cross-disciplines) would be helpful (helpful to small companies, or helpful to the analytic practitioner as it opens up more job opportunities?


Needless to say, statistics and machine learning are different from each other in principle*4. I understand data science is just a revised version of "data mining" or "knowledge discovery". In universities in Japan, statistics and machine learning are lectured in separate courses and then usually an expert in one is not an expert in the other. I hope we have more hybrid experts in both.


As far as I know, there are few courses about data engineering. I'm not sure whether they'll be major players in data science field.


A5: A broad range but in entry-level of academic skills in statistics, machine learning and data mining, with much skills in data engineering

What kind of training do you recommend for future data scientists? Any specific program in mind?


In my own opinion, data science in business here just requires almost entry-level statistics and machine learning, not deep knowledge about them; in contrast, it requires a plenty of skills in data engineering (including infrastructure or cloud engineering). Of course, novices must learn as stated above -- entry-level statistics and machine learning, and much about data engineering.


A6: This question is nonsense here, because almost no university professors agree it in Japan

How to get university professors more involved in teaching students how to process real live, big data sets? Should curricula be adapted, outdated material removed, new material introduced?


As my previous post argued, university professors in Japan are never interested in either big data or data science in real business. I feel the question is nonsense here.


A7: Although I'm not sure with considering the situation in Japan, a win-win relationship between labs and companies as labs get funding from companies and companies get intern students as good part-time analysts from labs, will encourage it

During my first year in my PhD program, I worked part-time for a high-tech small company, in partnership with my stats lab. This was a great experience - being exposed to the real world, and decently paid to do my PhD (in Belgium in 1988). How to encourage such initiatives in US?


As repeatedly mentioned above, I never expect the academic and corporate sectors will collaborate in the future... but ideally I agree that they must have a win-win relationship with each other, in terms of funding for the academia and human-resources or academic knowledges for companies.


A8: Distributed machine learning or multivariate statistics, such as Jubatus or Hivemall by Japanese developers, although they just work on a Hadoop eco-system

Besides Hadoop-like and graph database environments, do you see other technology that would made data plumbing easier for big data?


I know distributed ML is one of the hottest topics even in ML research field. It must be a strong solution for big data in the near future.

A9: In my opinion, it must make sense: but there may be still a lot of problem

Does it make sense to try to structure un-structured data (using tags, NLP, taxonomies, etc.)?


I hope it will work. But if one assumes it about data on SNS such as Twitter, I have to say there are still a lot of problems, because there are enormous trash data among them. In the worst cases, structuring un-structured data itself costs too much and never pay.


A10: 5 choices for both are too many, I think

Can you tell me 5 business activities that would benefit most from big data, and 5 that would benefit least?


Advertising, marketing, social-games, finance... oops, only 4 activities that would benefit most, I could imagine. Ah, 5 that would benefit least? What are you asking? Any other than 4 I mentioned.

*1:I know it's not enough yet here

*2:In particular cognitive neuroscience in the human brain

*3:Only 2% of postdoc successfully gets employed as a tenured professor per year

*4:But at the same time I know there are some common points such as logistic regression or any other statistical learning algorithms