Six Degrees of Hadoop Hardware

Computer hardware has evolved a lot since the early days of Hadoop – I am always looking for insights on this topic. At my previous company we had spent sometime on the infrastructure and architecture side of Hadoop.

So when I came across Eric’s blog at Hortonworks site, I spend sometime reading and thinking through.

Suddenly it dawned on me

Hadoop Hardware selection is not an abstract science at all ! – it is a function of the degrees of separation between the Hadoop Infrastructure & the Analytics end point (which could be a visualization system like Tableau or a recommendation/inference end point or even R program) !

Let me iterate couple of examples, showing how the degrees of separation affect the hardware selection and the Hadoop infrastructure:

  1. Six Degrees Of Separation :
    • When the Hadoop infrastructure is this far away (logically speaking, of course), most probably it is used as feeder in an ETL flow – for example a phone manufacturer processing logs from the carrier or an on-line game operator capturing events from the game system or an e-commerce system that is aggregating multiple streams…
    • Hardware characteristics :
    • Speeds and feeds rule here. More boxes (e.g Dell 310s), higher GHz, more memory (24 GB+), smaller disk space (4 TB/box) & a Gb network
  2. Four degrees of separation :
    • Hadoop as as intermediate layer – you have some data in the Hadoop infrastructure, there is some integration but your Data Warehouse has all your analytics systems
    • Hardware Characteristics
    • Definitely more storage than the previous systems, but the GHz and memory depends on the level of integration (for example joins) and so forth
  3. Two Degrees of Separation
    • You are using the Hadoop & Co (Hbase, Pig et al) as a processing ‘big data’ store, most probably with moderate data integration from other systems.
    • You also might have HBase, your users are employing Pig/Hive & the Hadoop infrastructure is directly wired to visualization tools like Tableau.
    • Most probably you are augmenting or delegating some data warehouse functions to the Hadoop Infrastructure
    • Hardware Characteristics:
    • 10 Gb is in your cards; start with 1Gb and explore link aggregation
    • 6 or 12 spindles per box would be optimal; 2 GB or 3 GB drives are becoming affordable anyway (An interesting link from hothardware)
  4. Here qualification and quantification of jobs matter – because you can expect different types of jobs some CPU intensive, others I/O intensive (either during map or shuffle or reduce stages)
  • General Thoughts:
  • Is it MHz or network-storage ? As Eric pointed out, the ‘dedicated disk’ thread and Scott’s answer are interesting
  • Brad’s blog has some insights on the network side of things
  • Last time I had talked with the Datascope project at JHU, they were experimenting with running Hadoop in atom boards !
  • A lot depends on the datapipeline – if ETL into DW systems, then less storage is OK as it is transitionary anyway
  • But if you are running HBase et al, then 2U 12 spindle or 6 spindle is ideal. 2 TB definitely, 3 TB is on its way
  • Another dimension is the complexity
  • If you are doing joins – mapside or reduce side, or using distributed cache, boxes with more memory would clearly help.
  • For joins, I like the memcache based solution, as you can setup a few memcache nodes and also the fact that this opens up for more system-wide integration in an orthogonally extensible way …
  • Finally one caveat: The new YARN/HadoopNextgen/0.23 is a wild card. We have to study how that beast behaves vis-a-vis memory, disk and network. I see some newer opportunities (and challenges) !

2 thoughts on “Six Degrees of Hadoop Hardware

  1. Pingback: Top 10 Steps to a Pragmatic Big Data Pipeline « My missives

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s