Scaling Big Data – Impermium


Came acorss an informative blog on scaling big data – “Built to Scale: How does Impermium process data?” Quick notes from the blog:
Doodle-Stevenson

  1. Don’t fall in love with a technology so much that you cannot be separated – Be flexible in scaling as you grow

    • “Parting is such a sweet sorrow”, but change is an essential component of an infrastructure at scale 
    • The technology selection and consumption should be a continuous process, introducing new technologies as needed by the growth. I found Impermium’s path from grep to Solr to Elastic Search very illuminating; I have done the same before.  
  2. Technology needs are not static

    • A corollary of #1 above – Growth on all parts of the stack will not be uniform.
    • For example Impermium found scaling challenges in search and they moved to Solr & then to Elastic Search
  3. There are no perfect technologies

    • If you are doing interesting work, be ready to tango with open source code. This is essential – I also found this to be true.
    • Even if you don’t plan to change the code, many times deep understanding comes from reading the code
  4. Select technologies that you can dance with

    • The flip side is that one should select technologies that you are comfortable working under the hood.
    • In my case, while I love Erlang, I am not that comfortable with that language. So given a chance, I will go with Java or Scala
  5. Benchmark is nothing but a story in a specific context

    • So true. Benchmarks are transitory & personal.
    • Understand them, but they need not be true for your transforms, your data model and your processing.
    • Benchmark early & benchmark often … with your scenarions, models, transformations, mapreduces & data

Thanks Young for the short but very interesting blog. Keep up the good work …

Cheers

<k/>

All the President’s DevOps


In the heels of “All the President’s Data Scientists” another interesting article on the Obama campaign’s cloud infrastructure.

Update : A similar article The Atlantic’s “When the Nerds Go Marching In”

Update : Case Study from New Relic How the Obama For America team improved resilience

Image

  • They realized the campaign needed a scalable system “2008 was the ‘Jaws’ moment,” said Obama for America’s Chief Technology Officer Harper Reed. “It was, ‘Oh my God, we’re going to need a bigger boat.”
  • They build a single shared data tier with APIs to build lots of interesting applications. “Being able to decouple all the apps from each other has such power; It allowed us to scale each app individually and to share a lot of data between the apps, and it really saved us a lot of time.”
  • They leveraged internet architecture ”We aggressively stood on the shoulders of giants like Amazon, and used technology that was built by other people,”
  • Doesn’t look like they used esoteric technologies. The system is built around Python APIs over RDS, SQS and so forth. Excellent and the fact that the systems can built this way is a testament to the cloud capabilities – IaaS & PaaS
  • In short Reed says it all “”When you break it down to programming, we didn’t build a data store or a faster queue. All we did was put these pieces together and arrange them in the right order to give the field organization the tools they needed to do their job. And it worked out. It didn’t hurt that we had a really great candidate and the best ground game that the world has ever seen.”

Facebook Infrastructure @ New Years Eve – A study in Scalability


Another interesting article on how Facebook is preparing for the New Year’s Eve, this time from our own San Jose Mercury News By Mike Swift.

Interesting points:

  • New Year is one of the busiest times for social network sites as people post pictures & exchange best wishes

CEO Mark Zuckerberg has long been focused on having the digital horsepower to support unbridled growth — are a key reason behind the .. network’s success

  • It received > 1 B photo uploads during Haloween 2010
  • Since then Facebook added 200 million more members and so New Year Eve 2012 can see more than 1.5 B uploads !
  • My favorite quote from the article:

The primary reason Friendster died was because it couldn’t handle the volume of usage it had. … They (Mark,Dustin and Sean) always talked about not wanting to be ‘Friendstered,’ and they meant not being overwhelmed by excess usage that they hadn’t anticipated

  • The engineers at Facebook just finished a preflight checklist and are geared up for the scale
  • In terms of scale “Facebook now reaches 55 percent of the global Internet audience, according to Internet metrics firm comScore and accounts for one in every seven minutes spent online around the world.”
  • From a Big Data perspective, Facebook data has all the essential proprieties viz. Connected & Contextual in addition to large scale – Volume & Velocity (see my earlier blog on big data)
  • Facebook has the “Emergency Parachutes” which let the site degrade gracefully  (for example display smaller photos when the site is heavily loaded)
  • Their infrastructure instrumentation is legendary (for example, the MySQL talk here)

To manage Facebook’s data infrastructure, you kind of need to have this sense of amnesia. Nothing you learned or read about earlier in your career applies here …

 
And finally, Our New Year Wishes to all readers & well wishers of this blog 

The Jobs Logs : An ode to an icon


My heart aches, and a drowsy numbness pains
My sense, as though of hemlock I had drunk,
Or emptied some dull opiate to the drains
One minute past, and Lethe-wards had sunk:

Steven Paul Jobs is no more … Hard to believe, harder to accept and almost impossible to imagine a future with out him … Instead of lamenting I decided to draw inspiration …

For me his stanford speech aptly titled “How to live before you die” personifies Steve

It is fun to be a pirate than join the Navy“! and other quotes from this Slideshare

Work Hard to make simple” and other quotes from Marko Saric’s blog

Huffington Post got it right when it quoted Steve – “… focus and simplicity. Simple can be harder than complex …

In many ways the essence of Steve is the very famous quote from Gizmodo and others:

When you’re a carpenter making a beautiful chest of drawers, you’re not going to use a piece of plywood on the back, even though it faces the wall and nobody will ever see it. You’ll know it’s there, so you’re going to use a beautiful piece of wood on the back. For you to sleep well at night, the aesthetic, the quality, has to be carried all the way through.—Playboy, 1987

On a technology level, “Jobs refused to accept that software and hardware were best designed and engineered separately. For him, the venerable insight summarized by Thomas Hughes, the grand historian of American technology, as ‘the system must be first’ …” – A good POV in IEEE spectrum by Pascal

[Update 11/7/11] Malcom Gladwell’s review in The Tweaker in The New Yorker is an excellent read. ‘Jobs’s sensibility was more editorial than inventive. “I’ll know it when I see it,” …’.

“Was Steve Jobs a Samuel Crompton (inventor) or was he a Richard Roberts (the tinkerer)?” asks Malcom …

As Washington Post saysHe (Jobs) would be getting off here; we were to proceed without him into the unknown. Let it go and look ahead was the message all along.

Other sources of good Jobs quotes: 

Ref:

How to Embrace Failure & Influence Scalability


As we continue our experiences with 10X scalability with our object store layer and get deeper into the design and development, it dawned on us that our first and foremost criteria is to befriend failures and architect for them! We have heard these ideas before, but it always becomes real when one feels their own growing pains.

We now understand very well what it means to “design a control plane for failure and tune the data for normal ops.”

… Read more at my Egnyte Engineering blog …

In the next blogs, we will talk about specific examples of how we embrace failures and influence scalability. The principles of Carnegie are not just for humans anymore – they are equally applicable to the machines we make, even when they are asleep and dreaming of electronic sheep! Or do they?

The Power of Curiosity and Inspiration – Jack Dorsey at Stanford


The little one was out on a sleepover & so spent an hour of the free time watching (and noting down points from) the video from Stanford’s Enterprise Corner

It is just beautiful. I urge all to watch it. My notes:

  • First couple of very insightful insights
    • You have to make every single detail perfect and you have to limit the number of details. If you pay attention to the smallest things while knowing what’s important, then everything else takes care of itself
    • Expect the unexpected and whenever possible, be the unexpected.” – marvelous & well said !
    • Apple is a theater company and Jack draws inspiration from Apple’s mode of operation!
      • Apple, I think, is run like a theater company. It has a great sense of pacing, a great sense of story & a great sense of execution. It’s all event-driven .. & stage-driven …
  • Birth Of Twitter :
    • He developed early version of Twitter in early 2000, but shelved it – “Wrong time, good idea, put it on the shelf”.
    • In 2005 tried it again – ” ..was given two weeks and one other programmer in Biz Stone to write the software.
    • “So, that’s how that sort of visualization and early desire to see the world led into Twitter” … And we did it”, and the rest of course is …
  • Origin of Square :
  • Power of a working product:
    • “The thing that really inspires people is a working product. When you’re pitching someone, the best thing you can do is show them something that works.” – Very good point
  • Payment is a form of communication:
    • Focusing on the user experience of money rather than the mechanics of transferring the value
  • Instrument Everything
    • Another good view point – Log, measure and test your infrastructure. They have an inference team focusing on the infrastructure instrumentation
    • “You have to instrument everything. For the first two years of Twitter’s life, we were flying blind. We had no idea what was going on with the network….”
  • Power of Story Telling & User Narratives:
    • Tell the story from a user perspective – like a play. One epic cohesive story, not a chain of short stories – solve a big problem
    • The product features fallout naturally from the user story
  • CEO As Chief Editor of the company’s Story:
    • Everyday the company generates “1000s of things that we could be doing but there’s only one or two that are important. As an editor, [the CEO is] constantly taking all these inputs and deciding on that one or that intersection of a few that makes sense for what we’re doing.”
      1. Editing People in and out
        • “… it’s always minding that team dynamic because at the end of the day, we’re just a group of people working on one single goal. If we can’t step in a cohesive coordinated fashion, then we’re going to trip all over the place…”
      2. Internal & External Communication stories
        • ” If you have that sort of high-level, this is where we’re going, this is the vision, this is the next 30 days … , it makes it very, very easy to set priorities and for all of the edges of the company to set their own priorities to do the right thing …”
      3. Editing the money (Revenue, Investors,…)
  • The Q & A had a few good insights as well.
    • Marketing Strategy
      • ” .. trying to do now is identify the key influencers in those merchant areas and make them distribution points.”
      • “A lot of the way I think about marketing is through the product itself. So, I think the marketing function, the best aspect and the best it can do is surface the product as much as possible.”
      • Understand the product introduction & adoption cycle
        • [They] have about three to five seconds to inspire someone to take action to actually get Square
        • then, [they] have about a week to get them to participate more – that’s by taking in transaction.
        • then, about a month to get them to be users forevermore.
      • Consumer Internet : “The more you can minimize the thinking around the mechanics in the moment, then more people are going to use it, more people are going to feel good about it.”
    • The importance of getting an idea out of one’s head & the cycle it follows -
      • ” … you need to get it out of your head. The reason you have to get it out of your head is you need to be able to see it on a surface that is not in your mind.”
      • ” Once you can see it and once you can step back from it, then you can also decide toshare it with others
      • the idea either gathers momentum
      • or you can decide to shelve it
    • Square is “focused on the payment experience and all the information and all the platform around payments…. . It’s building that cohesive story end to end.”
    • “… I think of Square as a startup with many startups inside of it. That’s how we’re organizing the company internally. We’re going to have a lot of different projects. They’ll be coordinated by this one cohesive unit outside.” – interesting

Giga Om has a good article – Jack Dorsey on Square, How It Works & Why It Disrupts

Another one in Technology Review – The New Money

My Next stop : Software For Data Analysis by Chambers

And after that : Battlestar Gallactica – The Mini Series (relevant especially in light of the new Computer Overlords !)

And a little

Google – A study in Scalability and A little systems horse sense


Google’s Jeff Dean did an excellent talk at Stanford as part of EE380 – it is worth one’s time to listen. Very informative, instructive and innovative. As I listened, I jotted a few quick notes.

  • Interesting comparison of the scale in search from 1999 to 2010
    • Docs and queries are up 1000X, while the query latency has decreased 5X
    • Interesting to hear that in 1999 they used to update a web page store in a month or two, but now it is reduced 50000X to seconds!
  • They have had 7 significant revisions in 11 years
  • Trivia : They encounter very expensive queries for example “circle of death” requires ~30GB of I/O
  • Trivia : In 2004, they did a rethink and refreshed the systems infrastructure from scratch
  • He discussed a little about encodings – informative discussion on Byte aligned variable length & group encoding schemes << I have to try it out …
  • Trivia : They have had long distance links failure by wild dogs, sharks, dead horses and (in Oregon) drunken hunters !
  • Jeff talked in length about MapReduce. An interesting set of statistics of MapReduce over time
    • MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
    • Data has doubled while the number of machines have been constant from ‘07 to ‘10.
    • Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
    • Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up !  Only after the fact did they hear about the network rewiring !
  • He is working on a project called Spanner, that spans data centers. Looks like this is one of their hairy problems. Dean also mentioned this during Q & A. All their systems work well inside a datacenter, but have no way of spanning datacenters. They have manual methods & task specific tools to copy data across datacenters, monitor tasks across datacenters and so forth. But no systemic infrastructure.
    • Declarative Policies, Common namespaces, transactions, strong and weak consistency and automation are all parts that Spanner addresses
  • I think the most important part of the talk was the final section on experiences & patterns
    • Break Large systems into smaller services. << We have heard this from Amazon as well. The Google page touches 200+ services. (Same with amazon page)
    • One should be able to estimate performance based on back-of the envelope calculations
    • I am compelled to insert Jeff’s “Numbers Everyone Should Know” as it is a very useful chart. I hope Jeff doesn’t mind. [Update 11/17/10] Thanks to Kevin Le, this chart comes from Norvig’s blog.
    • Identify common problems & build general systems to address them. Very important not to be all things to all people.
      • Paraphrasing Jeff, “That final feature will drive you over the edge of complexity“!
    • Don’t design to scale infinitely – consider 5X – 50X growth. But > 100X requires redesign << very insightful
    • He likes the centralized master design & so does not suggest a fully distributed system. Of course, the master should not be in the data plane but can be a control plane artifact
    • He also likes multiple small units per machine than a mongo job running on a big machine. Smaller units of work are easier to recover, load balance and scale. << agreed!
  • He concludes the talk saying there are lots of interesting “Big Data” available.
    • I have seen this emergence of Data Scientists from multiple sources, here and here. << I agree, my main focus as well …
  • Couple of insights from the Q & A sessions
    • They run chained sequence of M/Rs that implement some part of a larger algorithm in a sequence of steps than a Map-Reduce-Reduce-… pattern
    • The predictive search is more of a resource issue and not a fundamental change in the underlying infrastructure
    • He wishes they had incorporated distributed transactions as a part of their infrastructure for example : in GFS et al. Many internal apps have rolled their own. BTW, Spanner has distributed transactions
  • Over all an excellent talk, as always … And a Big Thanks to Jeff Dean …

[Update 11/12/10] Last year Jeff has given a similar talk on scalability at WSDM 2009. Video [here] and slides [here]. Notes [here] and [here].

[Update 5/11/12] Question in Quora “What are the numbers that every computer engineer should know, according to Jeff Dean?

Dragnet in South Lake Union


Ladies and gentlemen: the story you are about to read may be true. Only some of the the names have been changed to protect the guilty.
Tuesday, November 02, 2010.

It was cool in Silicon Valley, there was a slight drizzle and a chill in the air.
We were working the day watch out of the city. The boss is Captain Glenn. My partner’s Ram. He’s a good player. I am Detective Sergeant Joe Friday.
We got a call about an unsavory talk at Stanford and a blog. We checked it out.  Looks like we were here only yesterday hauling Larry to the court
The perspectives talk was still going on. One of the networkers spoke with us.
It is terrible, officer,” she said. “He is talking about new business models, OpenFlow and even OPEN SOURCE … The world is coming to an end …
“All we want are the facts, ma’am”
“ Shh ..” said Ram, “Let us listen to what James has to say …”
“… the networks are in my way of optimizing my cloud infrastructure …”, said James
Ram was showing off his new iPhone 4 and pointed to James’ blog. What next, checking e-mails on cell phones ? I thought …

I remember hearing the same logic last year, as we were cruising the campus. But this year Hamilton has more conviction in his voice, may be the technologies have matured, I thought.
But just jerry-rigging merchant silicon and a multi-core CPU with open source software does not make a robust datacenter switch – even cops in this city understand that much technology, I mused. I remembered that the retailer had a big “gossip” failure some time back and a friend attributed it to having protocols not at the lowest optimum layer …
We hauled the complainants and the defendants to the court …
There was an Amicus Curiae filing for the complainants by a famous blogger waiting for us at the court …. I knew the blogger by fame – he has lots of “wisdom” especially about the clouds, so I listened …

“There is a common belief that routers are not power efficient because a fully configured router takes as much power four rack full of servers – but it misses the fact that that router has to handle all the communication between those 100s of servers. The density of work reflects the power consumption and associated circuitry …”, the brief said. I agree … almost my words !
“A MapReduce work load will shuffle the data across the network more than once”, the brief continued, “and Hamilton is correct asking for a flat fast network without the cumbersome of traditional network protocols. In fact, we have it in our big data/hadoop lab infrastructure; and some of our customers do just that; and our switches work optimally when configured that way … but then you need to give them enough power to handle that load …”

I remember seeing somewhere in the net that when you need Arnold, Danny DeVito in elevator shoes will not work … A Prius engine won’t be fit for an airliner …

So while the technical point about a versatile, orthogonally extensible, programmable network control plane with a cohesive and coherent programming model plus appropriate interfaces is the right direction, all the hype about open source and merchant silicon at scale looked like a red herring to me. Also the domain that the retailer is talking about is very specific, not yet main stream with the enterprises …

But Hamilton has lots of valid points and we should give credit where credit is due. The smart and innovative engineers at the retailer’s shop have done pioneering work in the cloud space. So we all should listen to them … and ignore the red herrings! No doubt, the issues Hamilton raises will definitely be mainstream challenges of tomorrow …

The judge was intrigued by a software defined network and ordered John and Kevin to add Open Controller Interface to their widgets. The engineers at the network shops seemed relieved, looks like they were expecting this for sometime…

His honor was not amused by the open source and new business model argument …
In his closing remarks, “As an advocate for open source at the network and server, will you open source your cloud implementation, in the name of a new improved business model?” he asked the retailer with the perspective … There was silence …

The court was adjourned with an advice to revisit this topic after six months …