See also my more recent blog ‘Top 10 Steps to a Pragmatic Big Data Pipeline’, if you haven’t seen it. If you are coming from there no need for a recursion ;o)

[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]

There seems to be a small difference of opinion on what ‘BIG DATA’ really is.

  • Curt Monash in his blog puts forward an argument that Big Data is really a bit bucket of data from a multitude of sources, nothing more nothing less.
  • Brian Hopkins is of the theory that Big Data = Volume + Variety + Variability + Velocity
  • Methinks:
    • I think ‘Big Data’ is a function of connectedness & context.
    • I read somewhere (I don’t want to misquote anybody, so let us keep it anonymous) that Hadoop excels in applying simple algorithms on large amount of data
  • Makes sense – once one starts applying sophisticated algorithms to a mass of data, one would need chaining of MapReduce tasks and many times the algorithms do not even fit into a MapReduce paradigm. MapReduce NextGen addresses some of those, but not all
  • There is also the Big Data vs. Smart Data where the data carries it’s models and semantics.
    • Rather than the Vs, it is the Cs – Context-uality & Connected-ness that make big data ‘BIG DATA’. I call it ‘Smart Data’; monikers aside, most probably it is the ability to apply different models, the capability to infer/predict and visualize that makes it a little different than the traditional usage. May be it is not the organization at all, but how we use it that differentiates ‘big data’ from ‘BIG DATA’.
    • Curt also has a problem with ‘Big Data Analytics’. I kind of agree, in the sense that it Big_Data_Analytics is a processing pipeline that spans the collection, transformation, storage, analysis, inference/prediction and most importantly visualization/infographics.
    • The Big Data Analytics, without calling it that is not a point function
    • One thing I agree with an article in SOA world is that there is also the dimension of semi-structured data
    • Which BTW is not the same as Curt’s multi-vs-poly structure. I don’t think Curt got that right – while changing structure is a big thing (that is one of the reasons why NOSQL came into existence) it by itself doesn’t make data any bigger !
    • And Gartner’s ‘Extreme Data’ moniker is no better than ‘Big Data’ – it is still vague (or very general) … ‘Smart Data’ might be better …
    • As Merv Adrian says, Where is Mr.Dundee when we need him !


    Update [10/23/11] Came across this link on Defining Big Data

    Update [10/24/11] “Last week there were several events that convinced me that one of the great tech bubbles inflating right now is around what people have agreed to call “Big Data.” Ouch ! NYTimes Bits

    Advertisement