Business Users Shouldn’t touch Hadoop even with a 99-foot pole !

Yep, I know, it is 10 foot pole; and the origin is from “10-foot poles that river boatmen used to pole their boats with”[1]

Back to the main feature, I was reading an piece by Andrew J Burst at GigaOM that “Hadoop needs a better front-end for business users”

Yikes. This is terrible … I would argue, no, make that insist, that business users be kept as far away as possible from Hadoop (& similar frameworks)

Allow me to elaborate …

  • Business users do need highly interactive analytic dashboards with knobs & dials into our deep machine learning models and sliders onto our AI machines, No doubt.
  • We don’t want to abandon our beloved business users with static-rigid-newtonian-deterministic artifacts; we want them to have living, (fire) breathing intelligent-inferential-predictive-models

  • But that control & interactivity is into a business analytics beast that has multiple layers, not directly onto a Hadoop or hadoop-like system.

Also separating the “what” form the “how” by a declarative interface is very important

You see, analytics has at least four layers viz. Infrastructure, Intelligence, Inference & Interface


  • Hadoop is Infrastructure, Spark is Infrastructure – The “How”
  • Machine Learning algorithms are Intelligence – Again lots of “How”
  • Models are Inference – the “What” Plus some “How”
  • Dashboard is the Interface (usually) – definitely the “How”
  • Interface can be recommendations, financial predictions, ad forecasts or even actual devices that interface to predictive models

  • And business needs knobs & dials at the Inference & Interface layers
    • The Infrastructure then appropriately fires frameworks Hadoop or Spark or Java or iPython …
  • Digging deeper, Hadoop itself has three layers – none of them operable by a business user, but real work horses

    • HDFS – the distributed File System

    • MapReduce – the distributed data parallel computation engine

    • HBase – the NOSQL data store

Back to Andrew’s points, Hadoop (and it’s cousins) should remain as a tool for the Chefs; but diners do need to express their choices and have the ability to “tweak” the seasonings, portions or even the amount of cooking; a declarative interface (which tells what but not how) comes from the domain specific menus catered by the restaurants which focus on respective culinary styles or even a fusion !

Now I am getting Hungry ! On my way to downstairs (am at the Hilton – NY Fashion District) to my favorite Chipotle – who in fact gives me the declarative freedom, without getting into their kitchen and the need to handle the saucepans ;o) It is better that way because I am terrible with cooking and spice measures – I can tell less salt but not the amount !


[2] Interface from

Building a Data Organization that works and works with business

One thing that caught my attention on Netflix’s Neil Hunt’s interview with Gigaom was this :

A Data Organization that works & works with business

Well said. That explains Netflix’s data Science in a nutshell that all Data Scientists should emulate !

From a Chief Data Scientist’s perspective, I really like their way of looking at Data Science viz:

  • The folks who do data Science for the whole business
  • The folks who build algorithms & 
  • The folks who do data engineering

In fact I had a blog on this specialization of Data Science skills

Netflix is putting more weight on actual behavior ! Interesting, we are also seeking similar effects ie differentiate between falling asleep on a couch vs. actually watching a TV show !  It is hard inference … Netflix has the blocker, I have nothing ;o(

Binge watching … interesting … We are actually working on algorithms to figure that out and change the ad mix. I plan to talk more at the TM Forum Digital Disruption Panel on December 9th !

bdtc-py-18-P76Finally, the fact that importance of Netflix recommendation engine is underrated is so true. In many ways the recommendation algorithms and engines are core to many systems.

In fact, we have a reverse recommendation strategy ! We recommend users to ads !

The Best Of the Worst – AntiPatterns & Antidotes in Big Data

Recently I was a panelist at the Big Data Tech Con discussing the Best of the Worst Practices in Big Data. We had an interesting and lively discussion with pragmatic,probing questions from the audience. I will post the video, if they publish it.

Here are my top 5 (Actually Bakers 5 ;o) ) – Anti Patterns and of course, the Antidotes …

1. Data Swamp

swamp-01Typical case of “ungoverned data stores addressing a limited data science audience“.

The scene reads like so :

The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. Now every one starts putting their data into this “lake”. After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence.

Larry at IT starts the data import process every week sometime in Friday night – they don’t want to load the DW with a Hadoop feed during customer hours. Sometimes Larry forgets and so you have no clue if the customer data is updated every week or every 14 days; there could be duplicates or missing record sets …

Gartner has an interesting piece about Data Lakes turning to Data Swamps – The Data Lake Fallacy: All Water and Little Substance … A must read …

The antidote is Data Curation. It needn’t be a very process heavy – have a consistent schema & publish them in a Wiki page. Of course, if you are part of a big organization (say retail) and have petabytes of data, naturally it calls for a streamlined process.

Data quality, data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline

2. Technology Stampede & Vendor Pollution

  • You start with a few machines, install Apache Hadoop and start loading data. Marketing fols catchup on your big data project and now you have a few apps running. A few projects come onboard and you have a decent sized cluster
  • A famous cloudy Hadoop vendor approaches the management and before you know it, the company has an enterprise license and you are installing the vendor’s Hadoop Manager & MPP database named after a GM car.
  • Naturally the management expects the organization to morph into a data-driven, buzz-compliant organization with this transition. And, naturally a Hadoop infrastructure alone is not going to suddenly morph the organization to a big data analytics poster child … vendor assertions aside …
  • And behold, the Pivotal guys approach another management layer and inevitably an enterprise license deal follows … their engineers come in and revamp the architecture, data formats, flows,…
    • Now you have Apache, the cloudy Hadoop vendor and this Pivotal thing – all Hadoop, all MPP databases, but of course, subtle differences & proprietary formats prevent you from doing anything substantial across the products …
      • In fact even though their offices are few miles apart in US-101, their products look like they are developed by people in different planets !
  • While all is going on, the web store front analytics folks have installed Cassandra & are doing interesting transformations in the NOSQL layer …
  • Of course the brick and mortar side of the business use HBase; now there is no easy way to combine inferences from the web & store-walk-ins
  • One or two applications (may be more) are using MongoDB and they also draw data from the HDFS store
  • And, the recent water cooler conversations indicate that another analytics vendor from Silicon Valley is having top level discussions with the CIO (who by now is frustrated with all the vendors and technology layers) … You might as well order another set of machines and brace for the next vendor wave …

The antidote is called Architecture & a Roadmap (I mean real architecture & roadmap based on business requirements)

Understand that no vendor has the silver bullet and that all their pitches are at best 30% true … One will need products from different vendors, but choose a few (and reject the rest) wisely ! 

3. Big Data to Nowhere

bridge-to-nowhereUnfortunately this is a very common scenario – IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. A conversation goes like this …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

Business : … (unprintable)

The Antidote is very simple. Build the full stack ie bits to business … do the data management ie collect-store-transform as well as a few apps that span the model-reason-predict-infer

Build incremental Decision Data Science & Product Data Science layers, as appropriate … for example the following conversation is a lot better …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

IT: With the data we have, we only know that they comprise ~30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

Now, business got an app, not fully production ready, still useful. IT has proved the value of the platform and can ask for more $ !

4. A Data too far

This is more of a technology challenge. When disparate systems start writing to a common store/file system, the resulting formats and elements would be different. You might get a few .gz files, a few .csv files and of course, parquet files. Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …

5. Technology Maze

This is the flip side of the data format challenge (#4 above) This one is because of the technology stampede and the resulting incompatible systems where data is locked in.

Antidote : Integrated workflows.

Came across an interesting article by Jay Kreps. Couple of quotes:

” .. being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products. But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.”

“…The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.”

“..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

The solution : A pub-sub infrastructure (Kafka) + “surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams

6. Where is the Tofu ?

This is succinctly articulated by Xavier Amatriain when he talked about the Netflix Prize:

It is very simple to produce “reasonable” recommendations and extremely difficult to improve them to become “great”. But there is a huge difference in business value between reasonable Data Set and great …

The Antidote : The insights and the algorithms should be relevant and scalable … There is a huge gap between Model-Reason and Deploy …

What says thou ? Pl add comments about the Anti Patterns that you have observed and lived thru!


[Jan 3, 15] Interesting blog in Forbes “Four Common Mistakes That Can Make For A Toxic Data Lake” says :

  1. Your big data strategy ends at the data lake

  2. The data in your data lake is abstruse

  3. The data in your data lake is entity-unaware

  4. The data in your data lake is not auditable





Google’s Jeff Dean on Scalable Predictive DeepLearning – A Kitbizer’s notes from Recsys 2014 (Note :

It is always interesting to hear from Jeff and understand what he is upto. I have blogged about his earlier talks at XLDB and at Stanford. Jeff Dean’s Keynote at RecSys2014 was no exception. The talk was interesting, the Q&A was stimulating and the links to papers … now we have more work ! – I have a reading list at the end.

Of course, you should watch it (YouTube Link) and go thru his keynote slides at the ACM Conference on Information and Knowledge Managment. Highlights of his talk, from my notes …


  • Build a system with simple algorithms and then throw lots of data – let the system build the abstractions. Interesting line of thought;
  • I remember hearing about it from Peter Norwig as well ie Google is interested in algorithms that get better with data
  • An effective recommendation system requires context ie. understand the user’s surroundings, previous behavior of the user, previous aggregated behavior of many other users and finally textual understanding.


  • He then elaborated one of the area they are working on — semantic embeddings, paragraph vector and similar mechanisms


Interesting concept of embedding similar things such that they are nearby in a high dimensional space!

  • Jeff then talked about using LSTM (Long Short-Term Memory) Neural Networks for translation.


  • Notes from Q & A:
    • The async training of the model and random initialization means that different runs will result in different models; but results are within epsilon
    • Currently, they are handcrafting the topology of these networks ie now many layers, how many nodes, the connections et al. Evolving the architecture (for example adding a neuron when an interesting feature is discovered) is still a research topic.
      • Between ages of 2 & 4, our brain creates 500K neurons / sec and from 5 to 15, starts pruning them !
    • The models are opaque and do not have explainability. One way Google is approaching this is by building tools that introspect the models … interesting
    • These models work well for classification as well as ranking. (Note : I should try this – may be for a Kaggle competition. 2015 RecSys Challenge !)
    • Training CTR system on a nightly basis ?
    • Connections & Scale of the models
      • Vision : Billions of connections
      • Language embeddings : 1000s of millions of connections
      • If one has more data, one should have less parameters;otherwise it will overfit
      • Rule of thumb : For sparse representations, one parameter per record
    • Paragraph vector can capture granular levels while a deep lSTM might be better in capturing the details – TBD
    • Debugging is still an art. Check the modelling; factor into smaller problems; see if different data is required
    • RBMs and energy based models have not found their way into GOOGL’s production; NNs are finding applications
    • Simplification & Complexity : NNs, once you get them working, forms this nice “Algoritmically simple computation mechanisms” in a darkish-brown box ! Less sub systems, less human engineering ! At a different axis of complexity
    • Embedding editorial policies is not easy, better to overlay them … [Note : We have an architecture where the pre and post processors annotate the recommendations/results from a DL system]
  • There are some interesting papers on both the topics that Jeff mentioned (This my reading list for the next few months! Hope it is useful to you as well !):
    1. Efficient Estimation of Word Representations in Vector Space [Link]
    2. Paragraph vector : Distributed Representations of Sentences and Documents [Link]
    3. [Quoc V.lee ‘s home page]
    4. Distributed Representations of Words and Phrases and their Compositionality [Link]
    5. Deep Visual-Semantic Embedding Model [Link]
    6. Sequence to Sequence Learning with Neural Networks [Link]
    7. Building high-level features using large scale unsupervised learning [Link]
    8. word2vec Tool for computing continuous distribution of words [Link]
    9. Large Scale Distributed Deep Networks [Link]
    10. Deep Neural Networks for Object Detection [Link]
    11. Playing Atari with Deep Reinforcement Learning [Link]
    12. Papers by Google’s Deep Learning Team [Link to Vincent Vanhoucke’s Page]
    13. And, last but not least, Jeff Dean’s Page

The talk was cut off after ~45 minutes. Am hoping they would publish the rest and the slides. Will add pointers when they are on-line. Drop me a note if you catch them …

Update [10/12/14 21:49] : They have posted the second half ! An watching it now !

 Context : I couldn’t attend the RecSys 2014; luckily they have the sessions on YouTube. Plan to watch, take notes & blog the highlights; Recommendation Systems are one of my interest areas.

  • Next : Netflix’s CPO Neal Hunt’s Keynote
  • Next + 1 : Future Of recommender Systems
  • Next + 2 : Interesting Notes from rest of the sessions
  • Oh man, I really missed the RecSysTV session. We are working on some addressable recommendations. Already reading the papers. Didn’t see the video for the RecSysTV sessions ;o(

The Sense & Sensibility of a Data Scientist DevOps

The other day I was pondering the subject of a Data Scientist & model deployment at scale as we are developing our data science layers consisting of Hadoop, HBase & Apache Spark. Interestingly earlier today I came across two artifacts – a talk by Cloudera’s @josh_wills and a presentation by (again) Cloudera’s Ian Buss.

The talks made a lot of sense independently, but add a lot more insight – collectively !  The context, of course, is the exposition of the curious case of data scientists as devops. The data products need an evolving data science layer …

It is well worth your time to follow the links above and listen to Josh as well as go thru Ian’s slides. Let me highlight some of the points that I was able to internalize …

Let me start with one picture that “rules them all” & summarizes the  synergy. The “Shift In Perspective” from Josh & the Spark slide from Ian


The concept of Data Scientist devops is very relevant. It extends the curious case of the Data Scientist profession to the next level.

Data products live & breath in the wild, they cannot be developed and maintained with a static set of the data. Developing an R model and then throwing it over the wall for a developer to translate won’t work.  Secondly, we need models that can learn & evolve in their parameter space.


 I agree with the current wisdom that Apache Spark is a good framework that spans the reason,model & deploy stages of data. 

Other interesting insights from Josh’s talk.


The virtues of being really smart is massively overrated; the virtues of being able to learn faster is massively underrated

Well said Josh.
P.S: Couldn’t find the video of Ian’s talk at the Data Science London meetup. Should be an interesting talk to watch …

AWS EC2 Price worksheet

It all started with a tweet Image

  • It so happens that I have been working on a similar worksheet for pricing & configuring our analytics infrastructure;
  • I modified the one I am working on (inspired by the original at ec2 pricing_and_capacity) & morphed into the one Otis wanted
  • The Excel worksheet is hosted it in github. Feel free to modify it to fit your needs. Let me know as well …
  • I have four sets of prices viz. on-demand, reserved-light,reserved-medium usage and reserved- heavy usage. The prices are calculated for one year (8640 hrs) off of the cell M1 – one has to prorate the upfront fees to get the effective $/hr rate
  • The worksheet has multiple uses – I use it to compute the price difference for different usage patterns-high memory for Spark, different sizes for HBase cluster et al. As it is a spreadsheet one could sort it on varying criteria; one could change the numbers (say 6 months) and see what model makes sense.
  • BTW, it is interesting to see that the Light -Reserved costs more in all cases except for the storage models.
  • Long time ago, I had a graphical representation, which has become very dated. I might resurrect it with the new prices …

The Spreadsheet :

The left columns have the attributes of the various EC2 models.



The 8640 (hrs/year) is in M1. All the calculations are based on this cell. The reserved light is interesting … it costs more !


The reserved medium does save $. Moreover, one can stop the instances when not needed.


I have calculated the yearly price prorating the upfront fees et al. But for Heavy Reserved, it is somewhat meaningless as they will charge for the whole year even if the instances are stopped. But changing the value in M1 gives a feel for the different tiers …


I would be happy to hear other inferences we can make and add columns to the worksheet …



The Chronicles of Robotics at First Lego League – Day 1

This week am at St.Louis, volunteering at the First lego league World Robotics Competition. Have been involved with First Robotics since 2004. Usually my position is Robotic Design Judge – a front seat view to interesting & innovative ideas on Robots.

For 3 days we have the Edwards Jones Dome & the America Center in St.Louis, MO.

Day 1 : Stardate : 91913.81

Judges’ on-site meeting & briefing, allocations & FLL opening ceremonies.

Some quick pictures … Full day of judging starts tomorrow early morning  … looking forward to it ….

  • View From my Hotel

  • FLL-01 FLL-02 RoomView-01RoomView-02
  • The Trophies

  • awards-01 awards-02
  • The Field 

  • The field occupies the stadium. It consists of six areas – Einstein(FLL), Geleleo, Franklin, Newton, Edison & Curie. I have tried to capture the view of the fields from the ground and from the bleachers.
  • Galeleo

  • Field-01-01 Field-01-02 Field-03 Field-04-01Field-04Field-01 Field-05
  • NASA Truck to beam the competitions live via satellite

  • Field-06 Field-07
  • Franklin & Newton

  • Field-02-01 Field-02 Fig-05-01 Field-08 Field-09
  • Einstein (FLL and the venue of opening ceremonies – below)

  • Field-10 Field-11
  • FLL Opening Ceremonies

  • Open-01 Open-02 Open-03-01 Open-03 Open-04
  • Tomorrow is a busy day – robot judging whole day. Might not get time to take pictures
  • Still have to cover the pits, convention center hall filled with team stalls et al. One has to be there to understand the scale and the energy !