This week I am attending the biennial High Performance Transaction Systems Workshop – HPTS 2011 (Agenda). I was expecting exciting discussions, insightful wisdom and overall a stimulating company – was not disappointed.
IMHO, the highlights till Day 1½ (Stardate -311188.01369863015) were the NOSQL & Big Data discussions by Netflix (Adrian), Facebook(Kannan), eBay(Tom Faster) & Microsoft(Ed Harris). There were other good presentations (James Hamilton/Amazon, Ike Nassi/SAP, Charles Lamb/Oracle,…) which I will discuss in another blog.
Highlights:
- We heard about Netflix use of aws, FaceBook Messaging Infrastructure, eBay Analytics Platform & Microsoft OSD (On-line Services Division)
- Facebook Message Infrastructure:
- 6 B+ Messages/day
- Average write 16 records across multiple column families
- 2+PB LZO compressed (6+PB uncompressed) in HBase
- Growing 250 TB/Month
- eBy Analytics
- >100 PB
- >50TB new data/day
- eBay sacrificed concurrency for capacity & speed with Teradata
- They can do a full table scan across PB of data in 32s !
- eBay has a private network across the Vegas and Phoenix datacenter with 20-40GB bandwidth
- Each datacenter has the full Teradata, Singularity, Hadoop stack
- The NOSQL datastore, the deployment topology, DR and the HA practices reflect the path chosen by the respective companies
- Netflix wanted cross DC replication with Availability as the main criteria. So they are using Cassandra
- Facebook has a cell architecture with users sharded to one cell; their goal was strong consistency, automatic failover, MapReduce and so forth. So they chose HBase
- eBay has a very systemic was of looking at the continuum as Structured, Semi Structured and Unstructured.
- Structured-Analyze & Report (6PB data, compressed to 1.6PB))
- Unstructured – Discover & Explore (20 PB data)
- Semi-structured – both in some mixture (40 PB data)
- So they chose Teradata for structured & Hadoop for unstructured
- Microsoft is using Dryad and a declarative layer called SCOPE on a virtual cluster architecture for their analytics platform called Cosmos
- Netflix is ~100% cloud-based
- eBay has the largest Teradata installation – 256 active nodes w/ a capacity of 84PB & the 3rd largest Hadoop Installation!
- Most probably Netflix is among the top three largest Cassandra Installation
- Largest Cassandra installation (known) is 400 nodes, 300TB (I have a strong doubt it is Netflix!)
- Oracle NOSQL is CA (w.r.t CAP) because it is on top of BDB.
- The consensus is that AP or CP is more interesting from a NOSQL prespective
- The basic sources of behavioral analytics are (Microsoft):
- Web Pages,
- Search Log,
- Browser Log &
- Advertisement Log
- Connected,Contextual Big Data as I had written before
Gory Details (a.k.a Guided tour through the slides):
- This section is WIP. The presentations are not yet up. I will point to the presentations when they are available
- No SQL Eco System <- Good slides with a couple of good observations
- Storage Infrastructure for Facebook messages [Slides]
- Slide #3 – Why they cose HBase is interesting
- Slide #11 – Shadow Testing strategy is informative. Testing at scale is always a challenge
- Slide #28 – Scares & Scars – a must read
- Some slides to study (i will point them out in the set as the presentations are on-line)
- The Netfllix Cloud backup & DR topology covering all failure scenarions – even aws account malfunction ! (Hint: There is a Read-Only copy in S3 with a different account!)
- They have done what I call “Design a control plane for failure and tune the data for normal ops.” in one of my blogs
- eBay’s table structure that has characteristics of SQL & NOSQL
- examples of Path Analysis extension to Teradata by eBay
- eBay’s Platform Metrics comparison of Teradata, Singularity and Hadoop
- While Hadoop has some good qualities, it also consumes more resources than Teradata
- The Netfllix Cloud backup & DR topology covering all failure scenarions – even aws account malfunction ! (Hint: There is a Read-Only copy in S3 with a different account!)
- Many more …
- Facebook presentation notes from James Hamilton
- Microsoft Cosmos notes from James Hamilton
And Finally Some interesting Remarks:
- The number of options for (NOSQL) persistence doubles every 1.5 years
- If it is not in memory, it is not data
- Analytics – combine data in surprising ways (Microsoft)
- By turning Big Data to Smart Data (My addition)
Pingback: BigData Counts « My missives