Jeff Dean : Lessons Learned While Building Infrastructure Software at Google

Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

Google’s Jeff Dean had an interesting presentation about his experience building GFS, MapReduce, BigTable & Spanner. For those interested in these papers, I have organized them – A Path through NOSQL Reading 

Highlights in pictures (Full slides at XLDB 2013 site):



Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms

I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :

What or Who is a Data Scientist ?

DataScientist = Part Hacker + Part Technologist + Part Detective + Part Scientist + Part Business Analyst + Part  Visual Artist

Is Hadoop the new stored procedure ?

Two things happened today for me to ask this question and am not offering any serious answer, yet !

  • First I had some quick chat with folks at MongoDB and as a result got me thinking about the MapReduce in Mongo and where could it go. MongoDB also has the new declarative aggregation framework.
  • My thesis is that, while now the MongoDB aggregation framework is JSON semantics+$ keywords, it could look a lot like a functional programming language – with high-order declarative functions like map/reduce, discriminated unions (like F#) and currying.
  • And later in the day I read Edd’s blog “5 Big Data Predictions”, also in Forbes. (While both are the same blog, there might ne interesting comments in each)
  • Lots of interesting observations from Edd. He is predicting better programming language support, but may be we are looking at it the wrong way – what we need is a better stored procedure support in the data layer. It also could the next point Edd was talking about-Streaming data processing ! Where best could we have that feature than at the data layer ?
  • Would we be able to write a social science data platform using the MongoDB aggregators ? Would MongoDB mapReduce fit the bill now ? If not, what would it take to make it so ?
  • There are two obvious paths – connector to an application artifact for example Hadoop connector or embed the map/reduce in the data layer. Both have their advantages and disadvantages. With the connector the mapReduce can scale orthogonally, but with the embedded feature, one can achieve real-time processing (within limits). May be this is the time for an application data store !
  • Would the datastores like MongoDB gain features like the Twitter Storm, Real-Time map reduce, hierarchical iterative functional aggregators  and so forth ?
  • GreenPlum’s Chorus is interesting – Can NOSQL datastores gain some of the relevant capabilities that Chorus has?

Finally, the beginning as the end,

  1. Is hadoop the new stored procedure or would the new stored procedures look like Hadoop ?
  2. Is Data and Application becoming inseparable at scale ?
  3. What says thee?

Top 10 Steps to a Pragmatic Big Data Pipeline

As you know Big Data is capturing lots of press time. Which is good, but what does it mean to the person in the trenches ? Some thoughts … as a Top 10 List :

[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]

10. Think of the data pipeline in multiple dimensions than a point technology & Evolve the pipeline with focus on all the aspects of the stages

  • While technologies are interesting, they do not work in insolation and neither should you think that way
  • Dimension 1 : Big Data (I had touched upon this in my earlier blog “What is Big Data anyway“) One should not only look at the Volume-Velocity-Variety-Variability but also at the Connectedness – Context dimensions.
  • Dimension 2 : Stages – The degrees of separation as in collect, store, transform, model/reason & infer stages
  • Dimension 3 : Technology – This is the discussion SQL vs. NOSQL, mapreduce vs Dryad, BI vs other forms et al
  • I have captured the dimensions in the picture. Did I succeed ? Let me know

9. Evolve incrementally focussing on the business values – stories to tell, inferences to derive, feature sets to influence & recommendations to make

Don’t get into the technologies & pipeline until there are valid business cases. The use cases are not hard to find, but they won’t come if you are caught up in the hype and forgrt to do the homework and due diligence …

8. Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

7. Match the impedance of the use case with the technologies

The stack in my diagram earlier is not required for all cases:

  • for example if you want to leverage big data for a Product Metrics from logs in Splunk, you might only need a modest hadoop infrastructure plus an interface to existing dashboard plus Hive for analysts who want to perform analytics
  • But if you want Behavioral Analytics with A/B testing with a 10min latency, a full fledged Big Data infrastructure with say hadoop, HDFS, HBase plus some modeling interfaces, would be appropriate
  • I had written an earlier blog about the Hadoop infrastructure as a function of the degrees of separation from the analytics end point

6. Don’t be afraid to jump the chasm when the time is appropriate

Big Data systems have a critical mass at each stage – that means lots of storage or may be a few fast machines for analytics, depending on the proposed project. If you have done your homework from a business and technology perspective, and have proven your chops with effective projects on a modest budget, this would be a good time to make your move for a higher budget. And when the time is right, be ready to get the support for a dramatic increase & make the move …

5. Trust But Verify

True for working with Teenagers, arms treaty between superpowers, a card game, and more closer to our discussion, Big Data Analytics. In fact, one of the core competency of a Data Scientist is a healthy dose of skepticism – said John Rauser [here & here] . I would add that as you rely more and more inferences to a big data infrastructure across the stages, make sure there are checks and balances, independent verification of some of the stuff the big data is telling you.

Another side note in the same line is the oscillation – as the feedback rate, volume and velocity increases there is also a tendency to overreact. Don’t equate the feedback velocity to the response velocity – for example don’t change your product feature set based on high velocity big data based product metrics, at a faster rate than the users can consume. Have a healthy respect for the cycles involved. For example I came across an article that talks about fast & slow big data – interesting. OTOH, be ready to make dramatic changes when you get faster feedbacks that indicate things are not working well, for whatever reason.

4.   Morph from Reactive to Predictive & Adaptive Analytics, thus simplifying and leveraging the power of Big Data

As I was writing this blog, came across Vinod Khosla’s speech at Nasscom meeting. A must read – here & here. His #1 and #2 in ‘cool dozen’ were about Big Data! The ability to infer the essentials from an onslaught of data is in fact the core of a big data infrastructure. Always make sure you can make a few fundamental succinct inferences that matter, out of your infrastructure. In short deliver “actionable” …

3. Pay attention to How & the Who

Edd wrote about this in Google+. Traditional IT builds the infrastructure for Collect and Store stages in a Big Data Pipeline. It also builds and maintains infrastructure for analytics processing, like Hadoop and visualization layer like Tableau. But the realm of Analyze,Model, Reason and the rest, requires a business view, which a Data Analyst or a Data Scientist would provide. Pontifying further, it makes sense for IT to move in this direction by providing a ‘landing zone’ for the business savvy Data Scientists & Analysts and thus lead the new way of thinking about computing, computing resources and talents …

2. Create a Big Data Strategy considering the agility, latencies & the transitory nature of the underlying world

[1/28/12] Two interesting articles – one “7 steps for business success with big data” in GigaOm and another “Why Big Data Won’t Make You Smart, Rich, Or Pretty” in Fast Company. Putting aside the well published big data projects, a normal big data project in an organization has it’s own pitfalls and opportunities for success.  Prototyping is essential, modelling and verifying the models is essential and above all have a fluid strategy that can leverage this domain …

<This is WIP. Am collecting the thoughts and working thru the list – delibeately keeping two slots (and may be one more to make a baker’s dozen!, pl bear with me … But I know how it ends ;o)>

1. And, develop the Art of Data Science

As Ed points out in Google+, big data is also about exploration and the art of Data Science is an essential element. IMHO this involves more work in the contextual, modeling and inference space, with R and so forth – resulting in new insights, new products, order of magnitude performance, new customer base et al.  While this stage is effortless and obvious in some domains, it is not that easy in others …

Six Degrees of Hadoop Hardware

Computer hardware has evolved a lot since the early days of Hadoop – I am always looking for insights on this topic. At my previous company we had spent sometime on the infrastructure and architecture side of Hadoop.

So when I came across Eric’s blog at Hortonworks site, I spend sometime reading and thinking through.

Suddenly it dawned on me

Hadoop Hardware selection is not an abstract science at all ! – it is a function of the degrees of separation between the Hadoop Infrastructure & the Analytics end point (which could be a visualization system like Tableau or a recommendation/inference end point or even R program) !

Let me iterate couple of examples, showing how the degrees of separation affect the hardware selection and the Hadoop infrastructure:

  1. Six Degrees Of Separation :
    • When the Hadoop infrastructure is this far away (logically speaking, of course), most probably it is used as feeder in an ETL flow – for example a phone manufacturer processing logs from the carrier or an on-line game operator capturing events from the game system or an e-commerce system that is aggregating multiple streams…
    • Hardware characteristics :
    • Speeds and feeds rule here. More boxes (e.g Dell 310s), higher GHz, more memory (24 GB+), smaller disk space (4 TB/box) & a Gb network
  2. Four degrees of separation :
    • Hadoop as as intermediate layer – you have some data in the Hadoop infrastructure, there is some integration but your Data Warehouse has all your analytics systems
    • Hardware Characteristics
    • Definitely more storage than the previous systems, but the GHz and memory depends on the level of integration (for example joins) and so forth
  3. Two Degrees of Separation
    • You are using the Hadoop & Co (Hbase, Pig et al) as a processing ‘big data’ store, most probably with moderate data integration from other systems.
    • You also might have HBase, your users are employing Pig/Hive & the Hadoop infrastructure is directly wired to visualization tools like Tableau.
    • Most probably you are augmenting or delegating some data warehouse functions to the Hadoop Infrastructure
    • Hardware Characteristics:
    • 10 Gb is in your cards; start with 1Gb and explore link aggregation
    • 6 or 12 spindles per box would be optimal; 2 GB or 3 GB drives are becoming affordable anyway (An interesting link from hothardware)
  4. Here qualification and quantification of jobs matter – because you can expect different types of jobs some CPU intensive, others I/O intensive (either during map or shuffle or reduce stages)
  • General Thoughts:
  • Is it MHz or network-storage ? As Eric pointed out, the ‘dedicated disk’ thread and Scott’s answer are interesting
  • Brad’s blog has some insights on the network side of things
  • Last time I had talked with the Datascope project at JHU, they were experimenting with running Hadoop in atom boards !
  • A lot depends on the datapipeline – if ETL into DW systems, then less storage is OK as it is transitionary anyway
  • But if you are running HBase et al, then 2U 12 spindle or 6 spindle is ideal. 2 TB definitely, 3 TB is on its way
  • Another dimension is the complexity
  • If you are doing joins – mapside or reduce side, or using distributed cache, boxes with more memory would clearly help.
  • For joins, I like the memcache based solution, as you can setup a few memcache nodes and also the fact that this opens up for more system-wide integration in an orthogonally extensible way …
  • Finally one caveat: The new YARN/HadoopNextgen/0.23 is a wild card. We have to study how that beast behaves vis-a-vis memory, disk and network. I see some newer opportunities (and challenges) !