Hadoop 2.0 & OpenStack – PB&J ?


New developments are happening in the world Hadoop – even I had written a few blogs about it !
The latest blog from Arun has more insight and ideas on the scheduler and resource management …

What caught my attention was the fact that two of my worlds suddenly converged – the world of Hadoop and the world of OpenStack … And they go well together … like the proverbial PB and Jelly … !

Let us quickly look at a few synergies …

  • From a Hadoop Cluster to a Hadoop Cloud
    • The decoupling of application management & resource allocation opens up a host of possibilities from leveraging elasticity, enterprise extension and data adjacency. Now the Hadoop application (which BTW could be an acyclic graph of MapReduce jobs!) can ask for resources in a declarative manner and the Openstack cloud resource manager can allocate based on  the cloud infrastructure primitives
  • Another important capability is the policy based MapReduce for Multi-tenancy, compliance and even just resource leveling. The same MapReduce application graph can be run in multiple domains and can reflect the compliance security and other infrastructure considerations
  • Leverage Swift storage platform – for example cloudfiles is a great interface for managing data while Hadoop is good for processing data. Mind you, I am still working through the implications, but I can see an analytic cloud infrastructure combining the strength of swift and Hadoop.
    • One thought is to run the HDFS with it’s artifacts separately as part of the swift layer and then run the MapReduce elastically as required – but this will raise the data latency issue … nobody said this will be easy ;o)
  • And extending the thought further, an analytic cloud which combines the Hadoop 2.0, Openstack cloud platform and something like R is not that far off …

As you can see I can think of a few more ideas, and am sure you can too … The possibilities are interesting … But before we get ahead of us, there is pragmatic work ahead of us

  • Obviously a Hadoop-aware scheduler framework in nova is the first step
    • Need to figure out the best way to map the Application Manager, Resource manager, the node manager and the container Arun is talking about.
    • We also need to capture the declarative directives, resource requirements and application characteristics
  • Swift over HDFS & extending the data layer is next.
    • I really want to explore how we can address the latency effectively
    • And still manage a data cloud layer with tiered data storage matching the data lifecycle
    • And associated data services like replication, infrastructure redundancy, encryption and so forth
  • Maybe a Hadoop execution platform & a Hadoop PaaS over Openstack would be in the horizon …

Again, these are just my first impressions … What says thee ?