Google’s Jeff Dean did an excellent talk at Stanford as part of EE380 – it is worth one’s time to listen. Very informative, instructive and innovative. As I listened, I jotted a few quick notes.
- Interesting comparison of the scale in search from 1999 to 2010
- Docs and queries are up 1000X, while the query latency has decreased 5X
- Interesting to hear that in 1999 they used to update a web page store in a month or two, but now it is reduced 50000X to seconds!
- They have had 7 significant revisions in 11 years
- Trivia : They encounter very expensive queries for example “circle of death” requires ~30GB of I/O
- Trivia : In 2004, they did a rethink and refreshed the systems infrastructure from scratch
- He discussed a little about encodings – informative discussion on Byte aligned variable length & group encoding schemes << I have to try it out …
- Trivia : They have had long distance links failure by wild dogs, sharks, dead horses and (in Oregon) drunken hunters !
- Jeff talked in length about MapReduce. An interesting set of statistics of MapReduce over time
- MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
- Data has doubled while the number of machines have been constant from ‘07 to ‘10.
- Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
- Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
- He is working on a project called Spanner, that spans data centers. Looks like this is one of their hairy problems. Dean also mentioned this during Q & A. All their systems work well inside a datacenter, but have no way of spanning datacenters. They have manual methods & task specific tools to copy data across datacenters, monitor tasks across datacenters and so forth. But no systemic infrastructure.
- Declarative Policies, Common namespaces, transactions, strong and weak consistency and automation are all parts that Spanner addresses
- Declarative Policies, Common namespaces, transactions, strong and weak consistency and automation are all parts that Spanner addresses
- I think the most important part of the talk was the final section on experiences & patterns
- Break Large systems into smaller services. << We have heard this from Amazon as well. The Google page touches 200+ services. (Same with amazon page)
- One should be able to estimate performance based on back-of the envelope calculations
- I am compelled to insert Jeff’s “Numbers Everyone Should Know” as it is a very useful chart. I hope Jeff doesn’t mind. [Update 11/17/10] Thanks to Kevin Le, this chart comes from Norvig’s blog.
- Identify common problems & build general systems to address them. Very important not to be all things to all people.
- Paraphrasing Jeff, “That final feature will drive you over the edge of complexity“!
- Don’t design to scale infinitely – consider 5X – 50X growth. But > 100X requires redesign << very insightful
- He likes the centralized master design & so does not suggest a fully distributed system. Of course, the master should not be in the data plane but can be a control plane artifact
- He also likes multiple small units per machine than a mongo job running on a big machine. Smaller units of work are easier to recover, load balance and scale. << agreed!
- He concludes the talk saying there are lots of interesting “Big Data” available.
- Couple of insights from the Q & A sessions
- They run chained sequence of M/Rs that implement some part of a larger algorithm in a sequence of steps than a Map-Reduce-Reduce-… pattern
- The predictive search is more of a resource issue and not a fundamental change in the underlying infrastructure
- He wishes they had incorporated distributed transactions as a part of their infrastructure for example : in GFS et al. Many internal apps have rolled their own. BTW, Spanner has distributed transactions
- Over all an excellent talk, as always … And a Big Thanks to Jeff Dean …
[Update 11/12/10] Last year Jeff has given a similar talk on scalability at WSDM 2009. Video [here] and slides [here]. Notes [here] and [here].
[Update 5/11/12] Question in Quora “What are the numbers that every computer engineer should know, according to Jeff Dean?“
Pingback: GNumbers « Quasi.dot
Pingback: pinboard November 13, 2010 — arghh.net
Pingback: Ein sehr interessanter Vortrag zu Infrastruktur- und Softwareskalierung von Google « Leben des wolf-u.li
Pingback: Dr Data's Blog
Pingback: Rhythm NewMedia Gets Funds, Ad Networks Still Hot; Mid-Rolls Growing Says Freewheel; The Privacy Czar
Great Summary.
Thanks.
I think that ‘Numbers Everyone Should Know’ chart might’ve originally come from Peter Norvig’s ‘Teach Yourself Programming in Ten Years’ http://norvig.com/21-days.html, written 5 years ago.
The chart is also much more legible there, and it comes with a very interesting essay on developing a career as a programmer.
Thanks Kevin. Have upgraded my blog
Pingback: Jeffrey Dean: escalabilidad y pequeños sistemas
Pingback: links for 2010-11-24 « Dan Creswell’s Linkblog
Pingback: Google – A study in Scalability and A little systems horse sense | Large Data Matters
Krishna – nice summary.
Thanks. It was a good talk …
Pingback: Numbers every Programmer should know | O(lol n)
Pingback: Google’s Jeff Dean on Scalable Predictive DeepLearning – A Kitbizer’s notes from Recsys 2014 | My missives