New developments are happening in the world Hadoop – even I had written a few blogs about it !
The latest blog from Arun has more insight and ideas on the scheduler and resource management …
What caught my attention was the fact that two of my worlds suddenly converged – the world of Hadoop and the world of OpenStack … And they go well together … like the proverbial PB and Jelly … !
Let us quickly look at a few synergies …
- From a Hadoop Cluster to a Hadoop Cloud
- The decoupling of application management & resource allocation opens up a host of possibilities from leveraging elasticity, enterprise extension and data adjacency. Now the Hadoop application (which BTW could be an acyclic graph of MapReduce jobs!) can ask for resources in a declarative manner and the Openstack cloud resource manager can allocate based on the cloud infrastructure primitives
- Another important capability is the policy based MapReduce for Multi-tenancy, compliance and even just resource leveling. The same MapReduce application graph can be run in multiple domains and can reflect the compliance security and other infrastructure considerations
- Leverage Swift storage platform – for example cloudfiles is a great interface for managing data while Hadoop is good for processing data. Mind you, I am still working through the implications, but I can see an analytic cloud infrastructure combining the strength of swift and Hadoop.
- One thought is to run the HDFS with it’s artifacts separately as part of the swift layer and then run the MapReduce elastically as required – but this will raise the data latency issue … nobody said this will be easy ;o)
- And extending the thought further, an analytic cloud which combines the Hadoop 2.0, Openstack cloud platform and something like R is not that far off …
As you can see I can think of a few more ideas, and am sure you can too … The possibilities are interesting … But before we get ahead of us, there is pragmatic work ahead of us
- Obviously a Hadoop-aware scheduler framework in nova is the first step
- Need to figure out the best way to map the Application Manager, Resource manager, the node manager and the container Arun is talking about.
- We also need to capture the declarative directives, resource requirements and application characteristics
- Swift over HDFS & extending the data layer is next.
- I really want to explore how we can address the latency effectively
- And still manage a data cloud layer with tiered data storage matching the data lifecycle
- And associated data services like replication, infrastructure redundancy, encryption and so forth
- Maybe a Hadoop execution platform & a Hadoop PaaS over Openstack would be in the horizon …
Again, these are just my first impressions … What says thee ?
