MapReduce Cloud Stack

As a continuation to my (budding) work on Big Data clouds, one pragmatic area is a cloud stack for the popular MapReduce framework.

First context:

  • The current MapReduce framework viz. Hadoop was developed with capabilities which are at least one generation behind.
  • We now have the cloud, lot more multi-core capable processing, faster network (10Gb and more) and even newer storage paradigms (iSCSI hardware/ToE over 10Gb, FC, FCoE, and so forth)
  • Moreover organizations are finding interesting applications for Mapreduce and have a need to graduate beyond a cluster and move to a cloud with multiple multi-tenant jobs running in their infrastructure
    • And the various internal BUs have spiky usage; which means elasticity
  • The AMZ EMR (Elastic MapReduce) has proven to be a market leader in terms of MapReduce Cloud Stack, but it is a public cloud; organizations could augment with internal MapReduce Cloud Frameworks.
    • BTW, I am a big fan of the AWS and encourage the usage of AWS on all appropriate projects. But compliance, security and other factors would force certain MapReduce projects to be run on internal private clouds or as an enterprise extension onto Service Provider clouds – extending the trust boundary and crossing the ownership boundary.
  • From a MapReduce stack perspective, it is clear that we need monitoring, authentication/authorization and charge-back. We also need the control plane – a job flow framework, some form of SLA as well as an execution engine.
  • The closest I could find ts the excellent blog by Edd, SMAQ. Very good concepts …
  • But we need one more transformation of the SMAQ stack – a morph if you will, decoupling the layers with an execution control plane and the compute stack added

Second, some alternatives I could think of:

  • First possibility is a new Apache project (say Fluffy) to develop a Hadoop Cloud Framework leveraging work already done incl Hive, cascade and so forth
  • Another interesting possibility is to add a MapReduce Control Plane Cloud Stack to the emerging OpenStack; of course, leveraging the Apache Hadoop
    • This can potentially give us the MapReduce framework from Hadoop and the compute framework from OpenStack’s Nova – best of both worlds!
  • If we assume an execution control plane, the obvious next question is “How would you give it instructions ?” and I hope it is a declarative form
    • Which means either a policy language (XACML et al) or a structured query of some sort (SQL/Pig) or a job flow language (JFL) or embedded primitives in the programming language – which is where LINQ, FlumeJava, Cascade and hive come in the picture
  • A LINQ-esque FlumeJava addition to the language with arbitrary MapReduce frameworks underneath is not out of the question, either. This is interesting …
  • And we need to consider the real-time/incremental processing paradigms – may it be the Google Dremel [8] or the Google Percolator [9]. I am not saying that we need a clone of these, but incorporate pieces from those frameworks that make sense and add value …
  • Answering to my own earlier question, a job control language (JCL) or a job flow language (JFL) might be an incremental evolution of Hadoop. May be AMZ folks reached the same conclusion w.r.t EMR ...

And finally, technical discussions and some interesting pointers I came across as I am thinking through this domain:

  1. Dryad : Distributed Data-Parallel Programs … [Link]
  2. Distributed Aggregation for Data-Parallel Computing [Link]
  3. Microsoft Dryad Site
  4. Presentations here, here and here
  5. SCOPE: easy and Efficient Parallel processing … [Link]
  6. Map-Reduce-Merge: Simplified …[Link]
  7. FlumeJava: … [Link]
  8. Dremel: Interactive Analysis of Web-Scale Datasets – Data as a Programming Paradigm
  9. Google percolator paper a.k.a Large Scale Incremental Processing [Link]
  10. MapReduce and Hadoop Future[Link]
  11. A Comparison of Approaches to Large-Scale Data Analysis [Link]
  12. Papers and blogs I haven’t yet had time to explore fully
    1. Job Scheduling for Multi-User MapReduce Clusters [Link]
    2. Data warehousing and analytics infrastructure at facebook [Link]
    3. HDFS Blog[Link]
  13. For completeness, the original Google papers
    1. MapReduce: Simplified Data Processing on Large Clusters
    2. Bigtable: A Distributed Storage System for Structured Data
    3. The Google File System
    4. Interpreting the Data: Parallel Analysis with Sawzall [Update 11/3/10 Sawzall is OpenSource ! Thanks Big G!]

I really like the Dryad papers [1][2]. Like life, we learn a lot from people who disagree with us; many times we are not defined by what we are but what we are not or even what others perceive what we are not. Both the papers take pains to show how Dryad is a lot more better than the Hadoop framework. I disagree, depends on the point-of-view I guess. But lots of nuggets of wisdom can be harvested out of those papers for a MapReduce Cloud framework!

The FlumeJava paper[7] is very interesting. It is a 2010 paper. Adding parallel processing natively into a programming language has merits – that is what Microsoft is doing with LINQ. Can we achieve the same capabilities with FlumeJava or more relevant to the current discussion – would language a library like FlumeJava be the MapReduce Cloud Stack ?

The next challenge is the real-time vs. batch or more precisely incremental processing – i.e. “latency proportional to the size of the update rather than the size of the repository“. The Google percolator[9] could be a precursor of things to come to Hadoop (I hope). Google has this in production since April ’10 ! Already questions are being asked

[Update 10/24/10] I had missed the analysis paper[11] by Abadi, Stonebraker, et al. Quoting “… Most importantly is that higher level interfaces, such as Pig, Hive, are being put on top of the MR foundation, and a number of tools similar in spirit but more expressive than MR are being developed, such as Dryad and Scope. This will make complex tasks easier to code in MR-style systems … ”

Going back to Hadoop, there are two new feature requests MAPREDUCE-2038 and MAPREDUCE-1849

  • MAPREDUCE-2038/Making reduce tasks locality-aware solves one problem – the storage virtualization.
    • If we choose the direction of partial hierarchical aggregation (a dremel-esque) and a multi-stage job flow paradigm, with a separate partialCombiner interface and appropriate configurations (intermediate node mount, et al) then we have the separation of storage from the stack (similar to what I have shown in my diagram).
    • And then we can create a scalable storage overlay (with semantics and associated directory configurations) that is efficient even when there are different job flows (with associated scenarios like optimization, failure monitor and so forth)
    • Things like aggregation queries and rack combiners coupled with ToR intermediate storage nodes might work !

In our lab, we have a collapsed network hierarchy with Nexus 7K, 10Gb network, UCS B-series compute blades, a few ToR intermediate storage overlay nodes (8TB UCS C-series – C200 boxes with 10Gb iSCSI h/w,ToE card Broadcom NetXtreme II 57711) and MDS 9222i multi-service modular storage switch. We are still working through the configurations and viability of the connections (for example we haven’t connected the 9222i yet, but are hoping we could leverage the iSAPI for a few good primitives ..)

Finally, may be this is a good idea, may be not – that is what I am trying to figure out.
In the mean time, what says thee ?



[Update 10/23/10] Interesting article in NYTimes : Beyond Hadoop: Next-Generation Big Data Architectures

[Update 10/28/10] Exchanged e-mails with the experts in this domain. A good POA in my follow-up blog.