Greg Papadopoulos : Make it Big by Working Fast and Small


xldb-04
Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

NEA’s Greg Papadopoulos had a view point on innovation and startups. Highlights in pictures. Of course, you should read thru the full presentation.

xldb-05

xldb-07

I really liked the “Common Characteristics Of Success”. Golden words indeed !

Jeff Dean : Lessons Learned While Building Infrastructure Software at Google


Image
Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

Google’s Jeff Dean had an interesting presentation about his experience building GFS, MapReduce, BigTable & Spanner. For those interested in these papers, I have organized them – A Path through NOSQL Reading 

Highlights in pictures (Full slides at XLDB 2013 site):

xldb-02

xldb-03

Scaling Big Data – Impermium


Came acorss an informative blog on scaling big data – “Built to Scale: How does Impermium process data?” Quick notes from the blog:
Doodle-Stevenson

  1. Don’t fall in love with a technology so much that you cannot be separated – Be flexible in scaling as you grow

    • “Parting is such a sweet sorrow”, but change is an essential component of an infrastructure at scale 
    • The technology selection and consumption should be a continuous process, introducing new technologies as needed by the growth. I found Impermium’s path from grep to Solr to Elastic Search very illuminating; I have done the same before.  
  2. Technology needs are not static

    • A corollary of #1 above – Growth on all parts of the stack will not be uniform.
    • For example Impermium found scaling challenges in search and they moved to Solr & then to Elastic Search
  3. There are no perfect technologies

    • If you are doing interesting work, be ready to tango with open source code. This is essential – I also found this to be true.
    • Even if you don’t plan to change the code, many times deep understanding comes from reading the code
  4. Select technologies that you can dance with

    • The flip side is that one should select technologies that you are comfortable working under the hood.
    • In my case, while I love Erlang, I am not that comfortable with that language. So given a chance, I will go with Java or Scala
  5. Benchmark is nothing but a story in a specific context

    • So true. Benchmarks are transitory & personal.
    • Understand them, but they need not be true for your transforms, your data model and your processing.
    • Benchmark early & benchmark often … with your scenarions, models, transformations, mapreduces & data

Thanks Young for the short but very interesting blog. Keep up the good work …

Cheers

<k/>

5 Steps to Pragmatic Data …er… Big Data


It is 2013 & Big Data is big news … Time to revisit my older (Nov’11) blog “Top 10 Steps to A Pragmatic Big Data Pipeline” … Some things have changed but many have remained the same …

5.  Chuck the hype, embrace the concept …

This seems to the first obvious step for organizations. From Ed Dumbill (“Big data” is an imprecise term...) to TechCrunch (“Perhaps it’s about the actual functionality of apps vs. the data“) agree with the concept, but the terms and marketing hypes have hit the proverbial roof. The point is, there are many ponies this pile & there is tremendous business value (so long as one is willing to discount the hype and think Big Data = All Data) …

I really like Mike Gualtieri’s very insightful definition of Big Data as

… the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers

Big Data 01

4. Don’t implement a Technology, implement THE Big Data pipeline

Think of Big Data in multiple dimensions than a point technology & evolve the pipeline focussing on all the aspects of the stages

Data Science 02

The technologies, the skill sets and the tools are evolving, so are the business requirements.

Chris Taylor addresses this very clearly (“Big Data must not be an elephant riding a bicycle“) – viz. One has to address the entire spectrum to get value …

Simply applying distributed storage and processing (like Hadoop) to extremely large data sets is like putting an elephant on a bicycle .. it just doesn’t make business sense — Chris Taylor

3. Think Hybrid – Big Data Apps, Appliances & Infrastructure

I had addressed this one in my earlier blog(“Big Data Borgs, Rise of the Big Data Machines &amp; Revenge of the Fallen Algorithms“)

The morale of the story : Think out-of-the box & inside-the-box.

Match the impedence of the use cases with appropriate technologies

2. Tell your stories, leveraging smart data, based on crisp business use cases & requirements

Evolve the systems incrementally focussing on the business values that determine the stories to tell, the inferences to derive, the feature sets to influence & the recommendations to make

Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. In fact, integration with BI is an interesting challenge for Big Data …

No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

From a business perspective Patrick Keddy of Iron Mountain has a few excellent suggestions on managing Big Data: 

Big data informs and enhances judgement and intuition, it should not replace them

Opt for progress over perfection

View the data in context

1. Apply the art of Data Science & Smart Data, paying attention to touch points

This still remains my #1. Data Science is the key differentiator resulting in new insights, new products, order of magnitude performance, new customer base et al – “a cohesive narrative from the numbers & statistics”

Data science is about trying to create a process that allows you to create new ways of thinking about problems that are novel, or you are trying to use data to create or make something.” says D.J.Patil

Smart Data = Big Data + context + inference + declaratively interactive visualization

smartData02

  • Smart Data is (inference) model driven & declaratively interactive
  • For example,
    • The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
    • Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data;  a gam or glm over the log data after running through several stages of MapReduce is smart data, because it could give you an insight as to what factors or combination of factors make a good phone a bad phone

Focus not only on the Vs (ie Volume,Velocity,Variability & variety) but also on the Cs (ie. Connectedness & Context)

The two main Big Data challenges in 2013 would be:

1st : Data integration across silos to get the comprehensive view &

2nd : Matching the real-time velocity of business viz. CEP, sense & respond et al.

 For example, I have already seen folking looking outside Hadoop for CEP and near-realtime response

“.. 85% of respondents say the issue is not about the volume of data but the ability to analyze and act on data in real timesays Ryan Hollenbeck quoting a 2012 Cap Gemini study (Italics mine)

Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms


I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :

All the President’s DevOps


In the heels of “All the President’s Data Scientists” another interesting article on the Obama campaign’s cloud infrastructure.

Update : A similar article The Atlantic’s “When the Nerds Go Marching In”

Update : Case Study from New Relic How the Obama For America team improved resilience

Image

  • They realized the campaign needed a scalable system “2008 was the ‘Jaws’ moment,” said Obama for America’s Chief Technology Officer Harper Reed. “It was, ‘Oh my God, we’re going to need a bigger boat.”
  • They build a single shared data tier with APIs to build lots of interesting applications. “Being able to decouple all the apps from each other has such power; It allowed us to scale each app individually and to share a lot of data between the apps, and it really saved us a lot of time.”
  • They leveraged internet architecture “We aggressively stood on the shoulders of giants like Amazon, and used technology that was built by other people,”
  • Doesn’t look like they used esoteric technologies. The system is built around Python APIs over RDS, SQS and so forth. Excellent and the fact that the systems can built this way is a testament to the cloud capabilities – IaaS & PaaS
  • In short Reed says it all “”When you break it down to programming, we didn’t build a data store or a faster queue. All we did was put these pieces together and arrange them in the right order to give the field organization the tools they needed to do their job. And it worked out. It didn’t hurt that we had a really great candidate and the best ground game that the world has ever seen.”

MongoDB Patterns : The 13th record – Lost & Found


Motivation & Usage

  • You suddenly realize that you cannot find a few records in your production MongoDB & you want to restore them
    • The pattern assumes that you know which records to restore ie you know couple of  unique field values in the lost records
  • Our use case : In our system we have lots of interesting Fact Tables & System Collections. While these collections are small they are very complex & strategic to the application (the why is a topic for another day, another pattern). And we know how many records are there in these collections.
    • Few days ago we realized that we have only 12 records in one of our system collection; it should have 13. We know which record is missing, but it is not easy to recreate.
    • Luckily I have been backing up the database regularly.
    • So, the mission if I choose to accept, is to find the 13th record from the latest backup that contains it and restore the record to our prod database

Non-usage

  • As we are storing data in json format, this will not work if we have blobs
  • Not good for large scale data restore

Code & Program

  • Backup command we have been using:
    • mongodump –verbose –host <host_name> –port <host_port>
    • This creates a snapshot under the <dump> directory. I usually rename it as mongodump2012-MM-DD
  • Start a local instance on mongo – mongod
  • Restore the database from the last backup (for example mongodump-2012-08-21) locally
  • Check to see if the missing record is in this backup and find the _id of the record using command line. If not try the next to last backup – for example mongodump-2012-08-17. (In our case, I found it in the next to last)
  • Copy the record to lostandfound collection
    • db.<collection_where_the_record_is>.find({“_id”:<id_of_the_record>).forEach( function(x){db.lostandfound.insert(x)} );
  • Export the collection
    • mongoexport -v -d <database_where_the_lostandfound_is> -c lostandfound -o paradise_regained.json
  • Import the collection to the prod database
    • ~/mongodb-osx-x86_64-2012-05-14/bin/mongoimport –verbose -u <user_name> -p <password> -h <host_name:port_number> -d <production_database> -c <collection_where_the_record_was_lost> –file paradise_regained.json
  • Check & verify that the correct document is recovered using command line

Notes & References

  • TBD

Facebook Infrastructure @ New Years Eve – A study in Scalability


Another interesting article on how Facebook is preparing for the New Year’s Eve, this time from our own San Jose Mercury News By Mike Swift.

Interesting points:

  • New Year is one of the busiest times for social network sites as people post pictures & exchange best wishes

CEO Mark Zuckerberg has long been focused on having the digital horsepower to support unbridled growth — are a key reason behind the .. network’s success

  • It received > 1 B photo uploads during Haloween 2010
  • Since then Facebook added 200 million more members and so New Year Eve 2012 can see more than 1.5 B uploads !
  • My favorite quote from the article:

The primary reason Friendster died was because it couldn’t handle the volume of usage it had. … They (Mark,Dustin and Sean) always talked about not wanting to be ‘Friendstered,’ and they meant not being overwhelmed by excess usage that they hadn’t anticipated

  • The engineers at Facebook just finished a preflight checklist and are geared up for the scale
  • In terms of scale “Facebook now reaches 55 percent of the global Internet audience, according to Internet metrics firm comScore and accounts for one in every seven minutes spent online around the world.”
  • From a Big Data perspective, Facebook data has all the essential proprieties viz. Connected & Contextual in addition to large scale – Volume & Velocity (see my earlier blog on big data)
  • Facebook has the “Emergency Parachutes” which let the site degrade gracefully  (for example display smaller photos when the site is heavily loaded)
  • Their infrastructure instrumentation is legendary (for example, the MySQL talk here)

To manage Facebook’s data infrastructure, you kind of need to have this sense of amnesia. Nothing you learned or read about earlier in your career applies here …

 
And finally, Our New Year Wishes to all readers & well wishers of this blog 

Top 10 Steps to a Pragmatic Big Data Pipeline


As you know Big Data is capturing lots of press time. Which is good, but what does it mean to the person in the trenches ? Some thoughts … as a Top 10 List :

[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]

10. Think of the data pipeline in multiple dimensions than a point technology & Evolve the pipeline with focus on all the aspects of the stages

  • While technologies are interesting, they do not work in insolation and neither should you think that way
  • Dimension 1 : Big Data (I had touched upon this in my earlier blog “What is Big Data anyway“) One should not only look at the Volume-Velocity-Variety-Variability but also at the Connectedness – Context dimensions.
  • Dimension 2 : Stages – The degrees of separation as in collect, store, transform, model/reason & infer stages
  • Dimension 3 : Technology – This is the discussion SQL vs. NOSQL, mapreduce vs Dryad, BI vs other forms et al
  • I have captured the dimensions in the picture. Did I succeed ? Let me know

9. Evolve incrementally focussing on the business values – stories to tell, inferences to derive, feature sets to influence & recommendations to make

Don’t get into the technologies & pipeline until there are valid business cases. The use cases are not hard to find, but they won’t come if you are caught up in the hype and forgrt to do the homework and due diligence …

8. Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

7. Match the impedance of the use case with the technologies

The stack in my diagram earlier is not required for all cases:

  • for example if you want to leverage big data for a Product Metrics from logs in Splunk, you might only need a modest hadoop infrastructure plus an interface to existing dashboard plus Hive for analysts who want to perform analytics
  • But if you want Behavioral Analytics with A/B testing with a 10min latency, a full fledged Big Data infrastructure with say hadoop, HDFS, HBase plus some modeling interfaces, would be appropriate
  • I had written an earlier blog about the Hadoop infrastructure as a function of the degrees of separation from the analytics end point

6. Don’t be afraid to jump the chasm when the time is appropriate

Big Data systems have a critical mass at each stage – that means lots of storage or may be a few fast machines for analytics, depending on the proposed project. If you have done your homework from a business and technology perspective, and have proven your chops with effective projects on a modest budget, this would be a good time to make your move for a higher budget. And when the time is right, be ready to get the support for a dramatic increase & make the move …

5. Trust But Verify

True for working with Teenagers, arms treaty between superpowers, a card game, and more closer to our discussion, Big Data Analytics. In fact, one of the core competency of a Data Scientist is a healthy dose of skepticism – said John Rauser [here & here] . I would add that as you rely more and more inferences to a big data infrastructure across the stages, make sure there are checks and balances, independent verification of some of the stuff the big data is telling you.

Another side note in the same line is the oscillation – as the feedback rate, volume and velocity increases there is also a tendency to overreact. Don’t equate the feedback velocity to the response velocity – for example don’t change your product feature set based on high velocity big data based product metrics, at a faster rate than the users can consume. Have a healthy respect for the cycles involved. For example I came across an article that talks about fast & slow big data – interesting. OTOH, be ready to make dramatic changes when you get faster feedbacks that indicate things are not working well, for whatever reason.

4.   Morph from Reactive to Predictive & Adaptive Analytics, thus simplifying and leveraging the power of Big Data

As I was writing this blog, came across Vinod Khosla’s speech at Nasscom meeting. A must read – here & here. His #1 and #2 in ‘cool dozen’ were about Big Data! The ability to infer the essentials from an onslaught of data is in fact the core of a big data infrastructure. Always make sure you can make a few fundamental succinct inferences that matter, out of your infrastructure. In short deliver “actionable” …

3. Pay attention to How & the Who

Edd wrote about this in Google+. Traditional IT builds the infrastructure for Collect and Store stages in a Big Data Pipeline. It also builds and maintains infrastructure for analytics processing, like Hadoop and visualization layer like Tableau. But the realm of Analyze,Model, Reason and the rest, requires a business view, which a Data Analyst or a Data Scientist would provide. Pontifying further, it makes sense for IT to move in this direction by providing a ‘landing zone’ for the business savvy Data Scientists & Analysts and thus lead the new way of thinking about computing, computing resources and talents …

2. Create a Big Data Strategy considering the agility, latencies & the transitory nature of the underlying world

[1/28/12] Two interesting articles – one “7 steps for business success with big data” in GigaOm and another “Why Big Data Won’t Make You Smart, Rich, Or Pretty” in Fast Company. Putting aside the well published big data projects, a normal big data project in an organization has it’s own pitfalls and opportunities for success.  Prototyping is essential, modelling and verifying the models is essential and above all have a fluid strategy that can leverage this domain …

<This is WIP. Am collecting the thoughts and working thru the list – delibeately keeping two slots (and may be one more to make a baker’s dozen!, pl bear with me … But I know how it ends ;o)>

1. And, develop the Art of Data Science

As Ed points out in Google+, big data is also about exploration and the art of Data Science is an essential element. IMHO this involves more work in the contextual, modeling and inference space, with R and so forth – resulting in new insights, new products, order of magnitude performance, new customer base et al.  While this stage is effortless and obvious in some domains, it is not that easy in others …

Cloud Object Store Infrastructure – We are not in Kansas anymore, we are in Google


The other day some of us visited the facilities of our systems supplier as a part of our new cloud object store development. It is a good facility with capabilities to rack and stack, plus a world class testing setup. …

We came back with an appreciation of what they do as well as a few important insights. Let me iterate the main points – you can find the gory details at my Egnyte Engineering Blog[Link].

  1. A modern digital multimedia world is the genesis for innovative designs in cloud storage systems – from dual systems connected by a 10GB internal link to temporal SSD caching …
  2. A cloud storage hardware systems life-cycle is interesting, to say the least – lot of work involved in the procurement-install-deploy-support-troubleshoot-maintain, especially the latter parts
  3. Disk is the new tape, literally – companies are using 60 disk systems for backup of stuff like old checks, bills et al. Internal storage clouds (and even external storage clouds) will replace tape systems for many storage needs
More at my Egnyte Engineering Blog[Link].
Of course we all know that, even Kansas is not the old Kansas anymore. It’s capital is Google, KS and it has the fastest wireless!