Top 10 Steps to a Pragmatic Big Data Pipeline


As you know Big Data is capturing lots of press time. Which is good, but what does it mean to the person in the trenches ? Some thoughts … as a Top 10 List :

[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]

10. Think of the data pipeline in multiple dimensions than a point technology & Evolve the pipeline with focus on all the aspects of the stages

  • While technologies are interesting, they do not work in insolation and neither should you think that way
  • Dimension 1 : Big Data (I had touched upon this in my earlier blog “What is Big Data anyway“) One should not only look at the Volume-Velocity-Variety-Variability but also at the Connectedness – Context dimensions.
  • Dimension 2 : Stages – The degrees of separation as in collect, store, transform, model/reason & infer stages
  • Dimension 3 : Technology – This is the discussion SQL vs. NOSQL, mapreduce vs Dryad, BI vs other forms et al
  • I have captured the dimensions in the picture. Did I succeed ? Let me know

9. Evolve incrementally focussing on the business values – stories to tell, inferences to derive, feature sets to influence & recommendations to make

Don’t get into the technologies & pipeline until there are valid business cases. The use cases are not hard to find, but they won’t come if you are caught up in the hype and forgrt to do the homework and due diligence …

8. Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

7. Match the impedance of the use case with the technologies

The stack in my diagram earlier is not required for all cases:

  • for example if you want to leverage big data for a Product Metrics from logs in Splunk, you might only need a modest hadoop infrastructure plus an interface to existing dashboard plus Hive for analysts who want to perform analytics
  • But if you want Behavioral Analytics with A/B testing with a 10min latency, a full fledged Big Data infrastructure with say hadoop, HDFS, HBase plus some modeling interfaces, would be appropriate
  • I had written an earlier blog about the Hadoop infrastructure as a function of the degrees of separation from the analytics end point

6. Don’t be afraid to jump the chasm when the time is appropriate

Big Data systems have a critical mass at each stage – that means lots of storage or may be a few fast machines for analytics, depending on the proposed project. If you have done your homework from a business and technology perspective, and have proven your chops with effective projects on a modest budget, this would be a good time to make your move for a higher budget. And when the time is right, be ready to get the support for a dramatic increase & make the move …

5. Trust But Verify

True for working with Teenagers, arms treaty between superpowers, a card game, and more closer to our discussion, Big Data Analytics. In fact, one of the core competency of a Data Scientist is a healthy dose of skepticism – said John Rauser [here & here] . I would add that as you rely more and more inferences to a big data infrastructure across the stages, make sure there are checks and balances, independent verification of some of the stuff the big data is telling you.

Another side note in the same line is the oscillation – as the feedback rate, volume and velocity increases there is also a tendency to overreact. Don’t equate the feedback velocity to the response velocity – for example don’t change your product feature set based on high velocity big data based product metrics, at a faster rate than the users can consume. Have a healthy respect for the cycles involved. For example I came across an article that talks about fast & slow big data – interesting. OTOH, be ready to make dramatic changes when you get faster feedbacks that indicate things are not working well, for whatever reason.

4.   Morph from Reactive to Predictive & Adaptive Analytics, thus simplifying and leveraging the power of Big Data

As I was writing this blog, came across Vinod Khosla’s speech at Nasscom meeting. A must read – here & here. His #1 and #2 in ‘cool dozen’ were about Big Data! The ability to infer the essentials from an onslaught of data is in fact the core of a big data infrastructure. Always make sure you can make a few fundamental succinct inferences that matter, out of your infrastructure. In short deliver “actionable” …

3. Pay attention to How & the Who

Edd wrote about this in Google+. Traditional IT builds the infrastructure for Collect and Store stages in a Big Data Pipeline. It also builds and maintains infrastructure for analytics processing, like Hadoop and visualization layer like Tableau. But the realm of Analyze,Model, Reason and the rest, requires a business view, which a Data Analyst or a Data Scientist would provide. Pontifying further, it makes sense for IT to move in this direction by providing a ‘landing zone’ for the business savvy Data Scientists & Analysts and thus lead the new way of thinking about computing, computing resources and talents …

2. Create a Big Data Strategy considering the agility, latencies & the transitory nature of the underlying world

[1/28/12] Two interesting articles – one “7 steps for business success with big data” in GigaOm and another “Why Big Data Won’t Make You Smart, Rich, Or Pretty” in Fast Company. Putting aside the well published big data projects, a normal big data project in an organization has it’s own pitfalls and opportunities for success.  Prototyping is essential, modelling and verifying the models is essential and above all have a fluid strategy that can leverage this domain …

<This is WIP. Am collecting the thoughts and working thru the list – delibeately keeping two slots (and may be one more to make a baker’s dozen!, pl bear with me … But I know how it ends ;o)>

1. And, develop the Art of Data Science

As Ed points out in Google+, big data is also about exploration and the art of Data Science is an essential element. IMHO this involves more work in the contextual, modeling and inference space, with R and so forth – resulting in new insights, new products, order of magnitude performance, new customer base et al.  While this stage is effortless and obvious in some domains, it is not that easy in others …

10 thoughts on “Top 10 Steps to a Pragmatic Big Data Pipeline

  1. Pingback: When is ‘big data’ really ‘BIG DATA’ ? « My missives

  2. Pingback: Quora

  3. Pingback: Big Data with TwitterAPI – Data Pipeline « My missives

  4. Pingback: Quora

  5. Pingback: All the President’s Data Scientists « My missives

  6. Pingback: Top 10 Steps to a Pragmatic Big Data Pipeline « Carl A R Weir's Blog

  7. Pingback: Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms | My missives

  8. Pingback: 5 Steps to Pragmatic Data …er… Big Data | My missives

  9. Pingback: 5 Steps to Pragmatic Data …er… Big Data | My missives

Leave a comment