Jeff Dean : Lessons Learned While Building Infrastructure Software at Google

Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

Google’s Jeff Dean had an interesting presentation about his experience building GFS, MapReduce, BigTable & Spanner. For those interested in these papers, I have organized them – A Path through NOSQL Reading 

Highlights in pictures (Full slides at XLDB 2013 site):



Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms

I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :

What or Who is a Data Scientist ?

DataScientist = Part Hacker + Part Technologist + Part Detective + Part Scientist + Part Business Analyst + Part  Visual Artist

Is Hadoop the new stored procedure ?

Two things happened today for me to ask this question and am not offering any serious answer, yet !

  • First I had some quick chat with folks at MongoDB and as a result got me thinking about the MapReduce in Mongo and where could it go. MongoDB also has the new declarative aggregation framework.
  • My thesis is that, while now the MongoDB aggregation framework is JSON semantics+$ keywords, it could look a lot like a functional programming language – with high-order declarative functions like map/reduce, discriminated unions (like F#) and currying.
  • And later in the day I read Edd’s blog “5 Big Data Predictions”, also in Forbes. (While both are the same blog, there might ne interesting comments in each)
  • Lots of interesting observations from Edd. He is predicting better programming language support, but may be we are looking at it the wrong way – what we need is a better stored procedure support in the data layer. It also could the next point Edd was talking about-Streaming data processing ! Where best could we have that feature than at the data layer ?
  • Would we be able to write a social science data platform using the MongoDB aggregators ? Would MongoDB mapReduce fit the bill now ? If not, what would it take to make it so ?
  • There are two obvious paths – connector to an application artifact for example Hadoop connector or embed the map/reduce in the data layer. Both have their advantages and disadvantages. With the connector the mapReduce can scale orthogonally, but with the embedded feature, one can achieve real-time processing (within limits). May be this is the time for an application data store !
  • Would the datastores like MongoDB gain features like the Twitter Storm, Real-Time map reduce, hierarchical iterative functional aggregators  and so forth ?
  • GreenPlum’s Chorus is interesting – Can NOSQL datastores gain some of the relevant capabilities that Chorus has?

Finally, the beginning as the end,

  1. Is hadoop the new stored procedure or would the new stored procedures look like Hadoop ?
  2. Is Data and Application becoming inseparable at scale ?
  3. What says thee?

Top 10 Steps to a Pragmatic Big Data Pipeline

As you know Big Data is capturing lots of press time. Which is good, but what does it mean to the person in the trenches ? Some thoughts … as a Top 10 List :

[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]

10. Think of the data pipeline in multiple dimensions than a point technology & Evolve the pipeline with focus on all the aspects of the stages

  • While technologies are interesting, they do not work in insolation and neither should you think that way
  • Dimension 1 : Big Data (I had touched upon this in my earlier blog “What is Big Data anyway“) One should not only look at the Volume-Velocity-Variety-Variability but also at the Connectedness – Context dimensions.
  • Dimension 2 : Stages – The degrees of separation as in collect, store, transform, model/reason & infer stages
  • Dimension 3 : Technology – This is the discussion SQL vs. NOSQL, mapreduce vs Dryad, BI vs other forms et al
  • I have captured the dimensions in the picture. Did I succeed ? Let me know

9. Evolve incrementally focussing on the business values – stories to tell, inferences to derive, feature sets to influence & recommendations to make

Don’t get into the technologies & pipeline until there are valid business cases. The use cases are not hard to find, but they won’t come if you are caught up in the hype and forgrt to do the homework and due diligence …

8. Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

7. Match the impedance of the use case with the technologies

The stack in my diagram earlier is not required for all cases:

  • for example if you want to leverage big data for a Product Metrics from logs in Splunk, you might only need a modest hadoop infrastructure plus an interface to existing dashboard plus Hive for analysts who want to perform analytics
  • But if you want Behavioral Analytics with A/B testing with a 10min latency, a full fledged Big Data infrastructure with say hadoop, HDFS, HBase plus some modeling interfaces, would be appropriate
  • I had written an earlier blog about the Hadoop infrastructure as a function of the degrees of separation from the analytics end point

6. Don’t be afraid to jump the chasm when the time is appropriate

Big Data systems have a critical mass at each stage – that means lots of storage or may be a few fast machines for analytics, depending on the proposed project. If you have done your homework from a business and technology perspective, and have proven your chops with effective projects on a modest budget, this would be a good time to make your move for a higher budget. And when the time is right, be ready to get the support for a dramatic increase & make the move …

5. Trust But Verify

True for working with Teenagers, arms treaty between superpowers, a card game, and more closer to our discussion, Big Data Analytics. In fact, one of the core competency of a Data Scientist is a healthy dose of skepticism - said John Rauser [here & here] . I would add that as you rely more and more inferences to a big data infrastructure across the stages, make sure there are checks and balances, independent verification of some of the stuff the big data is telling you.

Another side note in the same line is the oscillation – as the feedback rate, volume and velocity increases there is also a tendency to overreact. Don’t equate the feedback velocity to the response velocity – for example don’t change your product feature set based on high velocity big data based product metrics, at a faster rate than the users can consume. Have a healthy respect for the cycles involved. For example I came across an article that talks about fast & slow big data – interesting. OTOH, be ready to make dramatic changes when you get faster feedbacks that indicate things are not working well, for whatever reason.

4.   Morph from Reactive to Predictive & Adaptive Analytics, thus simplifying and leveraging the power of Big Data

As I was writing this blog, came across Vinod Khosla’s speech at Nasscom meeting. A must read – here & here. His #1 and #2 in ‘cool dozen’ were about Big Data! The ability to infer the essentials from an onslaught of data is in fact the core of a big data infrastructure. Always make sure you can make a few fundamental succinct inferences that matter, out of your infrastructure. In short deliver “actionable” …

3. Pay attention to How & the Who

Edd wrote about this in Google+. Traditional IT builds the infrastructure for Collect and Store stages in a Big Data Pipeline. It also builds and maintains infrastructure for analytics processing, like Hadoop and visualization layer like Tableau. But the realm of Analyze,Model, Reason and the rest, requires a business view, which a Data Analyst or a Data Scientist would provide. Pontifying further, it makes sense for IT to move in this direction by providing a ‘landing zone’ for the business savvy Data Scientists & Analysts and thus lead the new way of thinking about computing, computing resources and talents …

2. Create a Big Data Strategy considering the agility, latencies & the transitory nature of the underlying world

[1/28/12] Two interesting articles – one “7 steps for business success with big data” in GigaOm and another “Why Big Data Won’t Make You Smart, Rich, Or Pretty” in Fast Company. Putting aside the well published big data projects, a normal big data project in an organization has it’s own pitfalls and opportunities for success.  Prototyping is essential, modelling and verifying the models is essential and above all have a fluid strategy that can leverage this domain …

<This is WIP. Am collecting the thoughts and working thru the list – delibeately keeping two slots (and may be one more to make a baker’s dozen!, pl bear with me … But I know how it ends ;o)>

1. And, develop the Art of Data Science

As Ed points out in Google+, big data is also about exploration and the art of Data Science is an essential element. IMHO this involves more work in the contextual, modeling and inference space, with R and so forth – resulting in new insights, new products, order of magnitude performance, new customer base et al.  While this stage is effortless and obvious in some domains, it is not that easy in others …

Six Degrees of Hadoop Hardware

Computer hardware has evolved a lot since the early days of Hadoop – I am always looking for insights on this topic. At my previous company we had spent sometime on the infrastructure and architecture side of Hadoop.

So when I came across Eric’s blog at Hortonworks site, I spend sometime reading and thinking through.

Suddenly it dawned on me

Hadoop Hardware selection is not an abstract science at all ! – it is a function of the degrees of separation between the Hadoop Infrastructure & the Analytics end point (which could be a visualization system like Tableau or a recommendation/inference end point or even R program) !

Let me iterate couple of examples, showing how the degrees of separation affect the hardware selection and the Hadoop infrastructure:

  1. Six Degrees Of Separation :
    • When the Hadoop infrastructure is this far away (logically speaking, of course), most probably it is used as feeder in an ETL flow – for example a phone manufacturer processing logs from the carrier or an on-line game operator capturing events from the game system or an e-commerce system that is aggregating multiple streams…
    • Hardware characteristics :
    • Speeds and feeds rule here. More boxes (e.g Dell 310s), higher GHz, more memory (24 GB+), smaller disk space (4 TB/box) & a Gb network
  2. Four degrees of separation :
    • Hadoop as as intermediate layer – you have some data in the Hadoop infrastructure, there is some integration but your Data Warehouse has all your analytics systems
    • Hardware Characteristics
    • Definitely more storage than the previous systems, but the GHz and memory depends on the level of integration (for example joins) and so forth
  3. Two Degrees of Separation
    • You are using the Hadoop & Co (Hbase, Pig et al) as a processing ‘big data’ store, most probably with moderate data integration from other systems.
    • You also might have HBase, your users are employing Pig/Hive & the Hadoop infrastructure is directly wired to visualization tools like Tableau.
    • Most probably you are augmenting or delegating some data warehouse functions to the Hadoop Infrastructure
    • Hardware Characteristics:
    • 10 Gb is in your cards; start with 1Gb and explore link aggregation
    • 6 or 12 spindles per box would be optimal; 2 GB or 3 GB drives are becoming affordable anyway (An interesting link from hothardware)
  4. Here qualification and quantification of jobs matter – because you can expect different types of jobs some CPU intensive, others I/O intensive (either during map or shuffle or reduce stages)
  • General Thoughts:
  • Is it MHz or network-storage ? As Eric pointed out, the ‘dedicated disk’ thread and Scott’s answer are interesting
  • Brad’s blog has some insights on the network side of things
  • Last time I had talked with the Datascope project at JHU, they were experimenting with running Hadoop in atom boards !
  • A lot depends on the datapipeline – if ETL into DW systems, then less storage is OK as it is transitionary anyway
  • But if you are running HBase et al, then 2U 12 spindle or 6 spindle is ideal. 2 TB definitely, 3 TB is on its way
  • Another dimension is the complexity
  • If you are doing joins – mapside or reduce side, or using distributed cache, boxes with more memory would clearly help.
  • For joins, I like the memcache based solution, as you can setup a few memcache nodes and also the fact that this opens up for more system-wide integration in an orthogonally extensible way …
  • Finally one caveat: The new YARN/HadoopNextgen/0.23 is a wild card. We have to study how that beast behaves vis-a-vis memory, disk and network. I see some newer opportunities (and challenges) !

Waypoints From Big Data to Smart Data

Based on what I am seeing in the blog sphere as well as offerings from companies, I think a transition from Big Data to Smart Data is happening – and it is a good thing. A quick diagram first and then some musings …

  • Big Data = data at scale, connected and probably with hidden variables …
  • Smart Data = Big Data + context + embedded interactive (inference/reasoning) models
  • [Update via @kidehen] Smart Data = Linked (Big) Data + Context + Inference
  • And Analytic clouds turn Big Data into Smart Data
  • Smart Data is (inference) model driven & declaratively interactive
  • Contextualization – includes integration with BI systems as well as data mashup with structured and unstructured data

For example,

  • The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
  • Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data;  a gam or glm over the log data after running through several stages of MapReduce is smart data

We have been seeing many waypoints towards this trend towards smart data -

  • GigaOm mentions “Organizations won’t necessarily need data scientists to “turn information into gold”if the data scientists employed by their software vendors have already done most of the work. Think about it like functions within spreadsheet applications tuned to specific industries, or … PaaS … Just feed the application some data, push a button, and get results …”
    • My take – Big data will get smarter with embedded models and Analytics clouds will turn big data into PaaS ! My thoughts exactly ! Analytics frameworks will mature !
    • So while the profession of Data Scientists is in no danger, smarter data will make it easy for everybody to understand data and make inferences …
  • An interesting article in Channel Registerscreams “Platform wants to out-map, out-reduce Hadoop Teaching financial grids to dance like stuffed elephants”
    • While I don’t want to comment on the headline force fitting all the hadoop buzwords, I do think this is an important development ! From framework to defacto interface standard !
    • Makes sense in many ways
      • The Hadoop API has a declarative way of specifying the data parallelism, while still enumerating what one wants to do at the data processing stages
      • There is nothing else available and as you mention, has become defacto
      • Actually the popularity is not accidental, because it has been evolved by many as they use it to solve their problems (in fact the evolution was much better than what Google has done internally – see link)
      • One reason Platform chose to implement the API might be because then it is easy for folks to try out with Hadoop and then migrate to Platform’s Symphony – proposing and implementing a new set of programming models and interfaces would have been unsuccessful
      • Actually Hadoop is a PaaS, with it’s own stack that reaches out as deep as the network adjacency of racks (simple yet effective) and as high as application topology and everything in between (from storage to block size to number of tasks)
      • What it did not do was the cloud aware elastic infrastructure which is being addressed with Hadoop 2.0 (another link)
  • [Update] “Data will be the platform for Web 3.0″, says Linkedin’s Reid Hoffman – yep, of course, smart data !
  • And MapIncrease can add pizazz …

P.S: Interesting blog on data blog influencers. we are #274, made it just in top 300 !

Watson at Jeopardy – A Race Of Machines ?

The new Computer Overlords – A Race Of machines Or The Rise Of the Machines ?

Just a collection of Links – There are lots of blogs and videos and I thought of collecting them in a single place.

Let me know when you find interesting links :


Hadoop NextGen – From a Framework To a Big Data Analytics Platform

Exciting News, Hadoop is evolving ! As I was reading Arun Murthy’s blog, multiple thoughts crossed my mind on the evolution of this lovable toy animal — My first impressions …

  • From Data at Scale To Data at Scale + with complexity – connected & contextual
    • This, I think is the essence – from generic computation framework to scalability, with the new Hadoop platform we can process data at scale and with complexity – connected & contextual. For example the Watson Jeopardy dataset [Link] [Link]
  • From (relatively) static MapReduce To (somewhat) dynamic analytic platform
    • While we might not see a real-time Hadoop soon, the proposed ideas do make the platform more dynamic
    • The “support for alternate programming paradigms to MapReduce” by decoupling the the computation framework is an excellent idea
    • I think it is still Mapreduce at the core (am not sure if it will deviate) but generic computation frameworks can choose their own patterns ! I am looking forward to BioInformatics applications
    • The “Support for short-lived services” is interesting. I had blogged a little about this. Looking forward to how this evolves …
    • I am hoping that it would be possible via extensible programming models to interface with programming systems like R.
    • Embeddable, domain specific capabilities (for example algorithmics specific to bioinformatics) could be interesting

There are also a few things that might not be part of this evolution

  • From Cluster to Cloud ?
    • There is a proposed keynote by Dr. Todd Papaioannou/VP/Cloud Architecture at Yahoo, titled “Hadoop and the Future of Cloud Computing”.
    • Actually I would prefer to see “Cloud Computing in the future of Hadoop” ;o) Had a blog few weeks ago … I was hoping for a project fluffy !

      We need to move from a cluster to an elastic framework (from compute and storage prespective) – especially as Hadoop moves to an Analytic Platform. “The separation of management of resources in the cluster from management of the life cycle of applications and their component tasks results” is a good first step, now the resources can be instantiated via different mechanisms – cloud being the premier one
  • GPU
    • In the context of my coursework at JHU (BioInformatics) had a couple of talks with the folks working on DataScope. They plan to run Hadoop as one of the applications in their GPU cluster !
    • GPU computing is accelerating, and capability for Hadoop to run on GPU cluster would be interesting
  • Streamlined logging, monitoring and metering ?
    • One of the challenges we are facing in our Unified Fabric Big Data project is that it is difficult to parse the logs and make inferences that help us to qualify & quantify MapReduce jobs.
    • This also will help to create an analytic platform based on the Hadoop eco system. Now services like EMR, most probably do the second order metering by charging for the cloud infrastructure, as they spin separate VMs for every user (from my limited view)

In short, exciting times are ahead for Hadoop ! There is a talk tomorrow at the Bay Area HUG (Hadoop User Group) on this topic … plan to attend, and later contribute – this is exciting, cannot remain in the sidelines … Will blog on the points from tomorrow’s talk … [Update : HUG Presentations and Video Link]

I leave you with this picture from The Polar Express … time to jump aboard and enjoy the ride …