Based on what I am seeing in the blog sphere as well as offerings from companies, I think a transition from Big Data to Smart Data is happening – and it is a good thing. A quick diagram first and then some musings …

  • Big Data = data at scale, connected and probably with hidden variables …
  • Smart Data = Big Data + context + embedded interactive (inference/reasoning) models
  • [Update via @kidehen] Smart Data = Linked (Big) Data + Context + Inference
  • And Analytic clouds turn Big Data into Smart Data
  • Smart Data is (inference) model driven & declaratively interactive
  • Contextualization – includes integration with BI systems as well as data mashup with structured and unstructured data

For example,

  • The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
  • Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data;  a gam or glm over the log data after running through several stages of MapReduce is smart data

We have been seeing many waypoints towards this trend towards smart data -

  • GigaOm mentions “Organizations won’t necessarily need data scientists to “turn information into gold”if the data scientists employed by their software vendors have already done most of the work. Think about it like functions within spreadsheet applications tuned to specific industries, or … PaaS … Just feed the application some data, push a button, and get results …”
    • My take – Big data will get smarter with embedded models and Analytics clouds will turn big data into PaaS ! My thoughts exactly ! Analytics frameworks will mature !
    • So while the profession of Data Scientists is in no danger, smarter data will make it easy for everybody to understand data and make inferences …
  • An interesting article in Channel Registerscreams “Platform wants to out-map, out-reduce Hadoop Teaching financial grids to dance like stuffed elephants”
    • While I don’t want to comment on the headline force fitting all the hadoop buzwords, I do think this is an important development ! From framework to defacto interface standard !
    • Makes sense in many ways
      • The Hadoop API has a declarative way of specifying the data parallelism, while still enumerating what one wants to do at the data processing stages
      • There is nothing else available and as you mention, has become defacto
      • Actually the popularity is not accidental, because it has been evolved by many as they use it to solve their problems (in fact the evolution was much better than what Google has done internally – see link)
      • One reason Platform chose to implement the API might be because then it is easy for folks to try out with Hadoop and then migrate to Platform’s Symphony – proposing and implementing a new set of programming models and interfaces would have been unsuccessful
      • Actually Hadoop is a PaaS, with it’s own stack that reaches out as deep as the network adjacency of racks (simple yet effective) and as high as application topology and everything in between (from storage to block size to number of tasks)
      • What it did not do was the cloud aware elastic infrastructure which is being addressed with Hadoop 2.0 (another link)
  • [Update] “Data will be the platform for Web 3.0″, says Linkedin’s Reid Hoffman – yep, of course, smart data !
  • And MapIncrease can add pizazz …

P.S: Interesting blog on data blog influencers. we are #274, made it just in top 300 !

