Computer hardware has evolved a lot since the early days of Hadoop – I am always looking for insights on this topic. At my previous company we had spent sometime on the infrastructure and architecture side of Hadoop.
So when I came across Eric’s blog at Hortonworks site, I spend sometime reading and thinking through.
Suddenly it dawned on me
Hadoop Hardware selection is not an abstract science at all ! – it is a function of the degrees of separation between the Hadoop Infrastructure & the Analytics end point (which could be a visualization system like Tableau or a recommendation/inference end point or even R program) !
Let me iterate couple of examples, showing how the degrees of separation affect the hardware selection and the Hadoop infrastructure:
- Six Degrees Of Separation :
- When the Hadoop infrastructure is this far away (logically speaking, of course), most probably it is used as feeder in an ETL flow – for example a phone manufacturer processing logs from the carrier or an on-line game operator capturing events from the game system or an e-commerce system that is aggregating multiple streams…
- Hardware characteristics :
- Speeds and feeds rule here. More boxes (e.g Dell 310s), higher GHz, more memory (24 GB+), smaller disk space (4 TB/box) & a Gb network
- Hadoop as as intermediate layer – you have some data in the Hadoop infrastructure, there is some integration but your Data Warehouse has all your analytics systems
- Hardware Characteristics
- Definitely more storage than the previous systems, but the GHz and memory depends on the level of integration (for example joins) and so forth
- You are using the Hadoop & Co (Hbase, Pig et al) as a processing ‘big data’ store, most probably with moderate data integration from other systems.
- You also might have HBase, your users are employing Pig/Hive & the Hadoop infrastructure is directly wired to visualization tools like Tableau.
- Most probably you are augmenting or delegating some data warehouse functions to the Hadoop Infrastructure
- Hardware Characteristics:
- 10 Gb is in your cards; start with 1Gb and explore link aggregation
- 6 or 12 spindles per box would be optimal; 2 GB or 3 GB drives are becoming affordable anyway (An interesting link from hothardware)
- General Thoughts:
- Is it MHz or network-storage ? As Eric pointed out, the ‘dedicated disk’ thread and Scott’s answer are interesting
- Brad’s blog has some insights on the network side of things
- Last time I had talked with the Datascope project at JHU, they were experimenting with running Hadoop in atom boards !
- A lot depends on the datapipeline – if ETL into DW systems, then less storage is OK as it is transitionary anyway
- But if you are running HBase et al, then 2U 12 spindle or 6 spindle is ideal. 2 TB definitely, 3 TB is on its way
- Another dimension is the complexity
- If you are doing joins – mapside or reduce side, or using distributed cache, boxes with more memory would clearly help.
- For joins, I like the memcache based solution, as you can setup a few memcache nodes and also the fact that this opens up for more system-wide integration in an orthogonally extensible way …
- Finally one caveat: The new YARN/HadoopNextgen/0.23 is a wild card. We have to study how that beast behaves vis-a-vis memory, disk and network. I see some newer opportunities (and challenges) !