Big Data & NOSQL Nirvana : HPTS 2011 Day 1


This week I am attending the biennial High Performance Transaction Systems Workshop – HPTS 2011 (Agenda). I was expecting exciting discussions, insightful wisdom and overall a stimulating company – was not disappointed.

IMHO, the highlights till Day 1½ (Stardate -311188.01369863015) were the NOSQL & Big Data discussions by Netflix (Adrian), Facebook(Kannan), eBay(Tom Faster) & Microsoft(Ed Harris). There were other good presentations (James Hamilton/Amazon, Ike Nassi/SAP, Charles Lamb/Oracle,…) which I will discuss in another blog.

Highlights:

  • We heard about Netflix use of aws, FaceBook Messaging Infrastructure, eBay Analytics Platform & Microsoft OSD (On-line Services Division)
  • Facebook Message Infrastructure:
    • 6 B+ Messages/day
    • Average write 16 records across multiple column families
    • 2+PB LZO compressed (6+PB uncompressed) in HBase
    • Growing 250 TB/Month
  • eBy Analytics
    • >100 PB
    • >50TB new data/day
    • eBay sacrificed concurrency for capacity & speed with Teradata
    • They can do a full table scan across PB of data in 32s !
  • eBay has a private network across  the Vegas and Phoenix datacenter with 20-40GB bandwidth
    • Each datacenter has the full Teradata, Singularity, Hadoop stack
  • The NOSQL datastore, the deployment topology, DR and the HA practices reflect the  path chosen by the respective companies
    • Netflix wanted cross DC replication with Availability as the main criteria. So they are using Cassandra
    • Facebook has a cell architecture with users sharded to one cell; their goal was strong consistency, automatic failover, MapReduce and so forth. So they chose HBase
    • eBay has a very systemic was of looking at the continuum as Structured, Semi Structured and Unstructured.
      • Structured-Analyze & Report (6PB data, compressed to 1.6PB))
      • Unstructured – Discover & Explore (20 PB data)
      • Semi-structured – both in some mixture (40 PB data)
      • So they chose Teradata for structured & Hadoop for unstructured
    • Microsoft is using Dryad and a declarative layer called SCOPE on a virtual cluster architecture for their analytics platform called Cosmos
  • Netflix is ~100% cloud-based
  • eBay has the largest Teradata installation – 256 active nodes w/ a capacity of 84PB & the 3rd largest Hadoop Installation!
  • Most probably Netflix is among the top three largest Cassandra Installation
  • Largest Cassandra installation (known) is 400 nodes, 300TB (I have a strong doubt it is Netflix!)
  • Oracle NOSQL is CA (w.r.t CAP) because it is on top of BDB.
    • The consensus is that AP or CP is more interesting from a NOSQL prespective
  • The basic sources of behavioral analytics are (Microsoft):
    • Web Pages,
    • Search Log,
    • Browser Log &
    • Advertisement Log
    • Connected,Contextual Big Data as I had written before

Gory Details (a.k.a Guided tour through the slides):

  • This section is WIP. The presentations are not yet up. I will point to the presentations when they are available
  • No SQL Eco System <- Good slides with a couple of good observations
  • Storage Infrastructure for Facebook messages [Slides]
    • Slide #3 – Why they cose HBase is interesting
    • Slide #11 – Shadow Testing strategy is informative. Testing at scale is always a challenge
    • Slide #28 – Scares & Scars – a must read
  • Some slides to study (i will point them out in the set as the presentations are on-line)
    • The Netfllix Cloud backup & DR topology covering all failure scenarions – even aws account malfunction ! (Hint: There is a Read-Only copy in S3 with a different account!)
      • They have done what I call “Design a control plane for failure and tune the data for normal ops.” in one of my blogs
    • eBay’s table structure that has characteristics of SQL & NOSQL
    • examples of Path Analysis extension to Teradata by eBay
    • eBay’s Platform Metrics comparison of Teradata, Singularity and Hadoop
      • While Hadoop has some good qualities, it also consumes more resources than Teradata
  • Many more …
  • Facebook presentation notes from James Hamilton
  • Microsoft Cosmos notes from James Hamilton

And Finally Some interesting Remarks:

  • The number of options for (NOSQL) persistence doubles every 1.5 years
  • If it is not in memory, it is not data
  • Analytics – combine data in surprising ways (Microsoft)
  • Datacenter exothermic incident by running analytics applications which run at 85% CPU (Microsoft)
  • The One way or another we all are part of some experiments - A/B Testing or analytics based preference and all our actions end up in one of these platforms !
  • Believe it or not, even I ended up presenting on Precision Time Synchronization on Day 1! Last minute fill-in, Thanks to Pranta. In Hadoop terms “speculative execution” !
  • About these ads

    One thought on “Big Data & NOSQL Nirvana : HPTS 2011 Day 1

    1. Pingback: BigData Counts « My missives

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s