In the heels of “All the President’s Data Scientists” another interesting article on the Obama campaign’s cloud infrastructure.
Update : A similar article The Atlantic’s “When the Nerds Go Marching In”
Update : Case Study from New Relic How the Obama For America team improved resilience
- They realized the campaign needed a scalable system “2008 was the ‘Jaws’ moment,” said Obama for America’s Chief Technology Officer Harper Reed. “It was, ‘Oh my God, we’re going to need a bigger boat.”
- They build a single shared data tier with APIs to build lots of interesting applications. “Being able to decouple all the apps from each other has such power; It allowed us to scale each app individually and to share a lot of data between the apps, and it really saved us a lot of time.”
- They leveraged internet architecture ”We aggressively stood on the shoulders of giants like Amazon, and used technology that was built by other people,”
- Doesn’t look like they used esoteric technologies. The system is built around Python APIs over RDS, SQS and so forth. Excellent and the fact that the systems can built this way is a testament to the cloud capabilities – IaaS & PaaS
- In short Reed says it all “”When you break it down to programming, we didn’t build a data store or a faster queue. All we did was put these pieces together and arrange them in the right order to give the field organization the tools they needed to do their job. And it worked out. It didn’t hurt that we had a really great candidate and the best ground game that the world has ever seen.”
Finally we have our VPC and Mongo replica sets working. I still have to figure out the snapshots. Some notes – would appreciate comments, ideas, insights & wisdom. I have the full slides at slideshare.
I will post my notes from snapshot configuration …
“To cloud or not to cloud … ” That is the question many are asking in the wake of the news surrounding the seizure of MegaUpload last week. Aside from being a pun on Hamlet’s soliloquy, this is a very poignant question, because cloud storage is becoming an inseparable part of modern life (be it Apple’s iCloud, Microsoft’s SkyDrive, Amazon’s Cloud Drive, Egnyte or the ‘box’es …) for consumers as well as enterprises.
In many ways the question is not whether to use cloud storage, but how can one use cloud storage effectively and minimize the risk to business disruption.
I have a few pointers in that direction …
- Don’t mix Consumer Cloud Services & Enterprise Class Services
- Use hybrid storage cloud rather than a pure cloud-only service
- Manage the data lifecycle effectively
- Match the business requirements and the domain impedance
- Pay attention to data interoperability
Please allow me to explain … The gory details at my Egnyte blog …
Another interesting article on how Facebook is preparing for the New Year’s Eve, this time from our own San Jose Mercury News By Mike Swift.
- New Year is one of the busiest times for social network sites as people post pictures & exchange best wishes
CEO Mark Zuckerberg has long been focused on having the digital horsepower to support unbridled growth — are a key reason behind the .. network’s success
- It received > 1 B photo uploads during Haloween 2010
- Since then Facebook added 200 million more members and so New Year Eve 2012 can see more than 1.5 B uploads !
- My favorite quote from the article:
The primary reason Friendster died was because it couldn’t handle the volume of usage it had. … They (Mark,Dustin and Sean) always talked about not wanting to be ‘Friendstered,’ and they meant not being overwhelmed by excess usage that they hadn’t anticipated
- The engineers at Facebook just finished a preflight checklist and are geared up for the scale
- In terms of scale “Facebook now reaches 55 percent of the global Internet audience, according to Internet metrics firm comScore and accounts for one in every seven minutes spent online around the world.”
- From a Big Data perspective, Facebook data has all the essential proprieties viz. Connected & Contextual in addition to large scale – Volume & Velocity (see my earlier blog on big data)
- Facebook has the “Emergency Parachutes” which let the site degrade gracefully (for example display smaller photos when the site is heavily loaded)
- Their infrastructure instrumentation is legendary (for example, the MySQL talk here)
To manage Facebook’s data infrastructure, you kind of need to have this sense of amnesia. Nothing you learned or read about earlier in your career applies here …
And finally, Our New Year Wishes to all readers & well wishers of this blog
This week I am attending the biennial High Performance Transaction Systems Workshop – HPTS 2011 (Agenda). I was expecting exciting discussions, insightful wisdom and overall a stimulating company – was not disappointed.
IMHO, the highlights till Day 1½ (Stardate -311188.01369863015) were the NOSQL & Big Data discussions by Netflix (Adrian), Facebook(Kannan), eBay(Tom Faster) & Microsoft(Ed Harris). There were other good presentations (James Hamilton/Amazon, Ike Nassi/SAP, Charles Lamb/Oracle,…) which I will discuss in another blog.
- We heard about Netflix use of aws, FaceBook Messaging Infrastructure, eBay Analytics Platform & Microsoft OSD (On-line Services Division)
- Facebook Message Infrastructure:
- 6 B+ Messages/day
- Average write 16 records across multiple column families
- 2+PB LZO compressed (6+PB uncompressed) in HBase
- Growing 250 TB/Month
- eBy Analytics
- >100 PB
- >50TB new data/day
- eBay sacrificed concurrency for capacity & speed with Teradata
- They can do a full table scan across PB of data in 32s !
- eBay has a private network across the Vegas and Phoenix datacenter with 20-40GB bandwidth
- Each datacenter has the full Teradata, Singularity, Hadoop stack
- The NOSQL datastore, the deployment topology, DR and the HA practices reflect the path chosen by the respective companies
- Netflix wanted cross DC replication with Availability as the main criteria. So they are using Cassandra
- Facebook has a cell architecture with users sharded to one cell; their goal was strong consistency, automatic failover, MapReduce and so forth. So they chose HBase
- eBay has a very systemic was of looking at the continuum as Structured, Semi Structured and Unstructured.
- Structured-Analyze & Report (6PB data, compressed to 1.6PB))
- Unstructured – Discover & Explore (20 PB data)
- Semi-structured – both in some mixture (40 PB data)
- So they chose Teradata for structured & Hadoop for unstructured
- Microsoft is using Dryad and a declarative layer called SCOPE on a virtual cluster architecture for their analytics platform called Cosmos
- Netflix is ~100% cloud-based
- eBay has the largest Teradata installation – 256 active nodes w/ a capacity of 84PB & the 3rd largest Hadoop Installation!
- Most probably Netflix is among the top three largest Cassandra Installation
- Largest Cassandra installation (known) is 400 nodes, 300TB (I have a strong doubt it is Netflix!)
- Oracle NOSQL is CA (w.r.t CAP) because it is on top of BDB.
- The consensus is that AP or CP is more interesting from a NOSQL prespective
- The basic sources of behavioral analytics are (Microsoft):
- Web Pages,
- Search Log,
- Browser Log &
- Advertisement Log
- Connected,Contextual Big Data as I had written before
Gory Details (a.k.a Guided tour through the slides):
- This section is WIP. The presentations are not yet up. I will point to the presentations when they are available
- No SQL Eco System <- Good slides with a couple of good observations
- Storage Infrastructure for Facebook messages [Slides]
- Slide #3 – Why they cose HBase is interesting
- Slide #11 – Shadow Testing strategy is informative. Testing at scale is always a challenge
- Slide #28 – Scares & Scars – a must read
- Some slides to study (i will point them out in the set as the presentations are on-line)
- The Netfllix Cloud backup & DR topology covering all failure scenarions – even aws account malfunction ! (Hint: There is a Read-Only copy in S3 with a different account!)
- They have done what I call “Design a control plane for failure and tune the data for normal ops.” in one of my blogs
- eBay’s table structure that has characteristics of SQL & NOSQL
- examples of Path Analysis extension to Teradata by eBay
- eBay’s Platform Metrics comparison of Teradata, Singularity and Hadoop
- While Hadoop has some good qualities, it also consumes more resources than Teradata
- Many more …
- Facebook presentation notes from James Hamilton
- Microsoft Cosmos notes from James Hamilton
And Finally Some interesting Remarks:
Datacenter exothermic incident by running analytics applications which run at 85% CPU (Microsoft)
The One way or another we all are part of some experiments - A/B Testing or analytics based preference and all our actions end up in one of these platforms !
Believe it or not, even I ended up presenting on Precision Time Synchronization on Day 1! Last minute fill-in, Thanks to Pranta. In Hadoop terms “speculative execution” !
- The number of options for (NOSQL) persistence doubles every 1.5 years
- If it is not in memory, it is not data
- Analytics – combine data in surprising ways (Microsoft)
I liked the book a lot, it reads like a thriller- at least to me. I couldn’t put it down and was reading the book late night, during work days – to the chagrin of the family !
Stephen Levy has clearly chronicled Google’s ascend and the tribulations it encountered – internal and external, on the way. What is more interesting is the fact that he has written a set of very crisp & detailed explanation of the innovations that Google brought into the search & advertisement domains.
I agree with Stephen that Google is a “clever internet-startup-named-after-a-100-digit-number turned into a corporate phenomenon”. It is very interesting to read it’s agony to IPO (and the ecstasy of the investors!) If Google had it’s way it would have added a requirement of min SAT score (and a Stanford PhD – at least an MMDS Certificate) for buying it’s shares ! Am forced to quote Scott Reeves (Forbes Aug 2004) on Google’s targeted price of $108/$135 “Only those who were dropped on their head at birth [will] plunk down that kind of cash for an IPO” – ouch ! (I myself was ready for around $50)
Google – A sum of it’s Obsessions
Search (Of course!)
- PageRank, of course, refers to Larry Page’s Ranking Algorithm ! The PageRank estimates the importance of a page by the web pages that link to it. “We convert the entire web into a big equation with several hundred million variables”
- The concept of signals – viz factors like terms, capitalization, font size, position et al – as traits added with PageRank is the secret sauce that made Google’s search very effective.
- The search engines get major and minor rewrites “like changing the components of a flying plane – without the passengers knowing about it, but the ride becomes more comfortable and they get there faster “ not a perfect analogy but an effective simile!
- The engineers fret about any queries that do not get answered in the first page – in many ways clicking next page in a search result is a failure of the brilliant engineers behind the search engine. You have to read about the query “Audrey Fino” that vexed Amit Singhal Google’s chief of search engine. The search showed lots of Audrey Hepburn and that bothered Amit – “There’s a person somewhere names Audrey Fino and we didn’t have the smarts in the system to know this” and the remedy was of course – to state Stephen, a multi-year name detection and name classifier “algorithmic therapy” with a dash of “bigram breakage” added to taste !
- Rokc is rock unless it has little in front of it (when it becomes the capital of a state) or if preceded by Noah becomes ark ! Another such query was “Eika Kerzen” which requires translation (to German in this case) to get to the right search result.
Algorithmic purity & ubiquitous
- Google is an algorithmic company driven by computer science ! We can see that everywhere – successes and failure. For example the number of shares at IPO (2,718,281,828) is the Napier’s constant e ! During the bidding of patents, Google was bidding numbers like pi for Nortel’s patents
- Even the Google ad sales people consider themselves as mediators between madison avenue and algorithms – only Google can say with both the words in the same sentence, make it sensible and in the process create an industry where it makes billions of dollars – as one SEO chief puts it “It is not we want to put all our boxes in one basket, but there is only one basket in the industry”
- The great lengths the Google team would go to make search relevant is exemplified by the “running shoes gnome sculpture”. The engineers believed in algorithmic purity – and before the launch of the Froogle product search, “running shoes” would show a “garden gnome sculpture that happened to wear sneakers”. The team cannot ship a product that fails to differentiate between a lawn art and a footwear. It seems within a couple of days the offending link disappeared ! And the team learned that one of their teammates went ahead and bought the one-of-a-kind sculpture that taking it off the web site ! “The algorithm started showing the right results, … and we launched!”
- Search algorithmics sometimes had very strange effects – like showing the now defunct main office of bell telephone for a query “weather.com Philadelphia” – reason being the telephone company used to tell weather over the phone and this factoid was unearthed by the search algorithm !
- It is interesting to read how Google re-invented the bidding system “Vickery second-bid action system” because the engineer (Eric Veach) wanted to avoid the “bid shading”. In the end, like anything else that Google touches, they created an innovative system that combined a few factors like bidding and ad-positioning, adding competition & customer satisfaction, in the end creating a rolling revenue stream in the order of billions of dollars for Google – all in all a nifty feat!
- The concept of compressing data to understand it was a brilliant stroke – the Google project called Phil (Probabilistic Hierarchical Inferential Learner) resulted in understanding the essence of web pages and …. Contextual matching ads with the web page’s content service called “Google content-targeted advertising” which later became AdSense (after acquiring the company Applied Semantics!)
- Their success of algorithms (gave Eigen vector some credence) and the change of scale that came with that was what made Google Google ! As Luiz Barrozzo observed “There are programs that do not run on anything smaller than a 1000 machines, which means you are looking at a datacenter as a computer “
- Google affects whatever it touches in unpredictable ways – for example, Google’s racks maxed out (power & cooling) at Exodus that Exodus drove an 18-wheeler upto the colo, punched 3 holes in the wall and pumped cold air into Google’s cage through PVC pipes!
- As I was reading the book, there were a few people I knew who played prominent roles in Google – was wondering when Hal Varian would show up – he did in (P.116) and stayed relevant in a lot of pages with his team of “econometricians” cross between statisticians and economists !
- Was wondering when Sundar Pichai would show up, he did (P.205) and remained relevant as Steven narrated eloquently the advent of Google Chrome and the JavScript engine V8 … leveraging Google’s insistence on speed …
- Stephen has interviewed most of, if not all, the technology leaders and we get to meet them at the relevant topics.
- I think building 40 is called Building 0 or Nullplex. It is interesting as I work nextdoor – the only non-google building among the sea of bicycle trotting Googlers !
- Pages Law according to Brin – “Every 18 months, software becomes twice as slow” !
- Danger, which Andy Rubin cofounded, moved into the Palo Alto office when Google moved out of it in 1999 ! Eventually he left Danger and started Android …
- Google always was structured like a PhD program dorm in a university – as Andy Rubin puts it “There is an implied grading on a 4.0 scale of the questions during interview and anybody less than 3.0 is rejected; the GPS (Google Product Strategy) meetings are run like a PhD defenses”
- As told by Alan Eustace to Andy Rubin “Google’s brain is like a baby’s – an omnivorous sponge that was always getting smarter from the information it soaked up”!
- “We want Google to be that third half of your brain – Sergey, P.386
- “It’s quite amazing how the horizon of impossibility is drifting these days” Thurun
- The locus and trajectory of Google –“put Google in the driver ‘s seat on many decisions – large and small – that people make the course of a day and their lives![P.68]
- In this review I touched only a minimal set of interesting points (interesting to me!). The book has a lot of good read from Google’s China syndrome to how the Googlers shaped the last presidential election and later worked for the Obama administration to the controversies like Goggle view and the struggle with digitizing books.
- One important development that Stephen couldn’t include, due to the timing of the release of the book, was Google+. But don’t despair – Stephen has written that part of the story as an article in wired ! Best to read it after finishing the book.
- Readwriteweb has an article on the data scientist behind Google+
- And Stephen’s blog on Motorolla Mobility purchase is another good read, again an important step by Google.
- I just now saw a write up by infoworld on Google’s 5 biggest hits and misses.
- Next book on my reading list “I’m Feeling Lucky: The Confessions of Google Employee Number 59″ by Douglas Edward; it is on hold 3 of 7 from San Jose public library.
The other day some of us visited the facilities of our systems supplier as a part of our new cloud object store development. It is a good facility with capabilities to rack and stack, plus a world class testing setup. …
We came back with an appreciation of what they do as well as a few important insights. Let me iterate the main points – you can find the gory details at my Egnyte Engineering Blog[Link].
- A modern digital multimedia world is the genesis for innovative designs in cloud storage systems – from dual systems connected by a 10GB internal link to temporal SSD caching …
- A cloud storage hardware systems life-cycle is interesting, to say the least – lot of work involved in the procurement-install-deploy-support-troubleshoot-maintain, especially the latter parts
- Disk is the new tape, literally – companies are using 60 disk systems for backup of stuff like old checks, bills et al. Internal storage clouds (and even external storage clouds) will replace tape systems for many storage needs
More at my Egnyte Engineering Blog[Link
Of course we all know that, even Kansas is not the old Kansas anymore. It’s capital is Google, KS and it has the fastest wireless!
Based on what I am seeing in the blog sphere as well as offerings from companies, I think a transition from Big Data to Smart Data is happening – and it is a good thing. A quick diagram first and then some musings …
- Big Data = data at scale, connected and probably with hidden variables …
- Smart Data = Big Data + context + embedded interactive (inference/reasoning) models
- [Update via @kidehen] Smart Data = Linked (Big) Data + Context + Inference
- And Analytic clouds turn Big Data into Smart Data
- Smart Data is (inference) model driven & declaratively interactive
- Contextualization – includes integration with BI systems as well as data mashup with structured and unstructured data
- The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
- Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data; a gam or glm over the log data after running through several stages of MapReduce is smart data
We have been seeing many waypoints towards this trend towards smart data -
- GigaOm mentions “Organizations won’t necessarily need data scientists to “turn information into gold”if the data scientists employed by their software vendors have already done most of the work. Think about it like functions within spreadsheet applications tuned to specific industries, or … PaaS … Just feed the application some data, push a button, and get results …”
- My take – Big data will get smarter with embedded models and Analytics clouds will turn big data into PaaS ! My thoughts exactly ! Analytics frameworks will mature !
- So while the profession of Data Scientists is in no danger, smarter data will make it easy for everybody to understand data and make inferences …
- An interesting article in Channel Registerscreams “Platform wants to out-map, out-reduce Hadoop Teaching financial grids to dance like stuffed elephants”
- While I don’t want to comment on the headline force fitting all the hadoop buzwords, I do think this is an important development ! From framework to defacto interface standard !
- Makes sense in many ways
- The Hadoop API has a declarative way of specifying the data parallelism, while still enumerating what one wants to do at the data processing stages
- There is nothing else available and as you mention, has become defacto
- Actually the popularity is not accidental, because it has been evolved by many as they use it to solve their problems (in fact the evolution was much better than what Google has done internally – see link)
- One reason Platform chose to implement the API might be because then it is easy for folks to try out with Hadoop and then migrate to Platform’s Symphony – proposing and implementing a new set of programming models and interfaces would have been unsuccessful
- Actually Hadoop is a PaaS, with it’s own stack that reaches out as deep as the network adjacency of racks (simple yet effective) and as high as application topology and everything in between (from storage to block size to number of tasks)
- What it did not do was the cloud aware elastic infrastructure which is being addressed with Hadoop 2.0 (another link)
- [Update] “Data will be the platform for Web 3.0″, says Linkedin’s Reid Hoffman – yep, of course, smart data !
- And MapIncrease can add pizazz …
P.S: Interesting blog on data blog influencers. we are #274, made it just in top 300 !
New developments are happening in the world Hadoop – even I had written a few blogs about it !
The latest blog from Arun has more insight and ideas on the scheduler and resource management …
What caught my attention was the fact that two of my worlds suddenly converged – the world of Hadoop and the world of OpenStack … And they go well together … like the proverbial PB and Jelly … !
Let us quickly look at a few synergies …
- From a Hadoop Cluster to a Hadoop Cloud
- The decoupling of application management & resource allocation opens up a host of possibilities from leveraging elasticity, enterprise extension and data adjacency. Now the Hadoop application (which BTW could be an acyclic graph of MapReduce jobs!) can ask for resources in a declarative manner and the Openstack cloud resource manager can allocate based on the cloud infrastructure primitives
- Another important capability is the policy based MapReduce for Multi-tenancy, compliance and even just resource leveling. The same MapReduce application graph can be run in multiple domains and can reflect the compliance security and other infrastructure considerations
- Leverage Swift storage platform – for example cloudfiles is a great interface for managing data while Hadoop is good for processing data. Mind you, I am still working through the implications, but I can see an analytic cloud infrastructure combining the strength of swift and Hadoop.
- One thought is to run the HDFS with it’s artifacts separately as part of the swift layer and then run the MapReduce elastically as required – but this will raise the data latency issue … nobody said this will be easy ;o)
- And extending the thought further, an analytic cloud which combines the Hadoop 2.0, Openstack cloud platform and something like R is not that far off …
As you can see I can think of a few more ideas, and am sure you can too … The possibilities are interesting … But before we get ahead of us, there is pragmatic work ahead of us
- Obviously a Hadoop-aware scheduler framework in nova is the first step
- Need to figure out the best way to map the Application Manager, Resource manager, the node manager and the container Arun is talking about.
- We also need to capture the declarative directives, resource requirements and application characteristics
- Swift over HDFS & extending the data layer is next.
- I really want to explore how we can address the latency effectively
- And still manage a data cloud layer with tiered data storage matching the data lifecycle
- And associated data services like replication, infrastructure redundancy, encryption and so forth
- Maybe a Hadoop execution platform & a Hadoop PaaS over Openstack would be in the horizon …
Again, these are just my first impressions … What says thee ?
Exciting News, Hadoop is evolving ! As I was reading Arun Murthy’s blog, multiple thoughts crossed my mind on the evolution of this lovable toy animal — My first impressions …
- From Data at Scale To Data at Scale + with complexity – connected & contextual
- This, I think is the essence – from generic computation framework to scalability, with the new Hadoop platform we can process data at scale and with complexity – connected & contextual. For example the Watson Jeopardy dataset [Link] [Link]
- From (relatively) static MapReduce To (somewhat) dynamic analytic platform
- While we might not see a real-time Hadoop soon, the proposed ideas do make the platform more dynamic
- The “support for alternate programming paradigms to MapReduce” by decoupling the the computation framework is an excellent idea
- I think it is still Mapreduce at the core (am not sure if it will deviate) but generic computation frameworks can choose their own patterns ! I am looking forward to BioInformatics applications
- The “Support for short-lived services” is interesting. I had blogged a little about this. Looking forward to how this evolves …
- I am hoping that it would be possible via extensible programming models to interface with programming systems like R.
- Embeddable, domain specific capabilities (for example algorithmics specific to bioinformatics) could be interesting
There are also a few things that might not be part of this evolution
- From Cluster to Cloud ?
- There is a proposed keynote by Dr. Todd Papaioannou/VP/Cloud Architecture at Yahoo, titled “Hadoop and the Future of Cloud Computing”.
- Actually I would prefer to see “Cloud Computing in the future of Hadoop” ;o) Had a blog few weeks ago … I was hoping for a project fluffy !
We need to move from a cluster to an elastic framework (from compute and storage prespective) – especially as Hadoop moves to an Analytic Platform. “The separation of management of resources in the cluster from management of the life cycle of applications and their component tasks results” is a good first step, now the resources can be instantiated via different mechanisms – cloud being the premier one
- In the context of my coursework at JHU (BioInformatics) had a couple of talks with the folks working on DataScope. They plan to run Hadoop as one of the applications in their GPU cluster !
- GPU computing is accelerating, and capability for Hadoop to run on GPU cluster would be interesting
- Streamlined logging, monitoring and metering ?
- One of the challenges we are facing in our Unified Fabric Big Data project is that it is difficult to parse the logs and make inferences that help us to qualify & quantify MapReduce jobs.
- This also will help to create an analytic platform based on the Hadoop eco system. Now services like EMR, most probably do the second order metering by charging for the cloud infrastructure, as they spin separate VMs for every user (from my limited view)
In short, exciting times are ahead for Hadoop ! There is a talk tomorrow at the Bay Area HUG (Hadoop User Group) on this topic … plan to attend, and later contribute – this is exciting, cannot remain in the sidelines … Will blog on the points from tomorrow’s talk … [Update : HUG Presentations and Video Link]
I leave you with this picture from The Polar Express … time to jump aboard and enjoy the ride …