Business Users Shouldn’t touch Hadoop even with a 99-foot pole !

December 7, 2014 by ksankar

Yep, I know, it is 10 foot pole; and the origin is from “10-foot poles that river boatmen used to pole their boats with”[1]

Back to the main feature, I was reading an piece by Andrew J Burst at GigaOM that “Hadoop needs a better front-end for business users”

Yikes. This is terrible … I would argue, no, make that insist, that business users be kept as far away as possible from Hadoop (& similar frameworks)

Allow me to elaborate …

Business users do need highly interactive analytic dashboards with knobs & dials into our deep machine learning models and sliders onto our AI machines, No doubt.
We don’t want to abandon our beloved business users with static-rigid-newtonian-deterministic artifacts; we want them to have living, (fire) breathing intelligent-inferential-predictive-models
But that control & interactivity is into a business analytics beast that has multiple layers, not directly onto a Hadoop or hadoop-like system.

Also separating the “what” form the “how” by a declarative interface is very important

You see, analytics has at least four layers viz. Infrastructure, Intelligence, Inference & Interface

Hadoop is Infrastructure, Spark is Infrastructure – The “How”
Machine Learning algorithms are Intelligence – Again lots of “How”
Models are Inference – the “What” Plus some “How”
Dashboard is the Interface (usually) – definitely the “How”
Interface can be recommendations, financial predictions, ad forecasts or even actual devices that interface to predictive models
And business needs knobs & dials at the Inference & Interface layers
- The Infrastructure then appropriately fires frameworks Hadoop or Spark or Java or iPython …
Digging deeper, Hadoop itself has three layers – none of them operable by a business user, but real work horses
- HDFS – the distributed File System
- MapReduce – the distributed data parallel computation engine
- HBase – the NOSQL data store

Back to Andrew’s points, Hadoop (and it’s cousins) should remain as a tool for the Chefs; but diners do need to express their choices and have the ability to “tweak” the seasonings, portions or even the amount of cooking; a declarative interface (which tells what but not how) comes from the domain specific menus catered by the restaurants which focus on respective culinary styles or even a fusion !

Now I am getting Hungry ! On my way to downstairs (am at the Hilton – NY Fashion District) to my favorite Chipotle – who in fact gives me the declarative freedom, without getting into their kitchen and the need to handle the saucepans ;o) It is better that way because I am terrible with cooking and spice measures – I can tell less salt but not the amount !

[1] http://en.wiktionary.org/wiki/not_touch_something_with_a_ten_foot_pole

[2] Interface from http://img1.mxstatic.com/wallpapers/1bb91493c637d7c5ed6e1cefbef87ec1_large.jpeg

The Best Of the Worst – AntiPatterns & Antidotes in Big Data

November 6, 2014 by ksankar

Recently I was a panelist at the Big Data Tech Con discussing the Best of the Worst Practices in Big Data. We had an interesting and lively discussion with pragmatic,probing questions from the audience. I will post the video, if they publish it.

Here are my top 5 (Actually Bakers 5 ;o) ) – Anti Patterns and of course, the Antidotes …

1. Data Swamp

Typical case of “ungoverned data stores addressing a limited data science audience“.

The scene reads like so :

The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. Now every one starts putting their data into this “lake”. After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence.

Larry at IT starts the data import process every week sometime in Friday night – they don’t want to load the DW with a Hadoop feed during customer hours. Sometimes Larry forgets and so you have no clue if the customer data is updated every week or every 14 days; there could be duplicates or missing record sets …

Gartner has an interesting piece about Data Lakes turning to Data Swamps – The Data Lake Fallacy: All Water and Little Substance … A must read …

The antidote is Data Curation. It needn’t be a very process heavy – have a consistent schema & publish them in a Wiki page. Of course, if you are part of a big organization (say retail) and have petabytes of data, naturally it calls for a streamlined process.

Data quality, data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline

2. Technology Stampede & Vendor Pollution

You start with a few machines, install Apache Hadoop and start loading data. Marketing fols catchup on your big data project and now you have a few apps running. A few projects come onboard and you have a decent sized cluster
A famous cloudy Hadoop vendor approaches the management and before you know it, the company has an enterprise license and you are installing the vendor’s Hadoop Manager & MPP database named after a GM car.
Naturally the management expects the organization to morph into a data-driven, buzz-compliant organization with this transition. And, naturally a Hadoop infrastructure alone is not going to suddenly morph the organization to a big data analytics poster child … vendor assertions aside …
And behold, the Pivotal guys approach another management layer and inevitably an enterprise license deal follows … their engineers come in and revamp the architecture, data formats, flows,…
- Now you have Apache, the cloudy Hadoop vendor and this Pivotal thing – all Hadoop, all MPP databases, but of course, subtle differences & proprietary formats prevent you from doing anything substantial across the products …
  - In fact even though their offices are few miles apart in US-101, their products look like they are developed by people in different planets !
While all is going on, the web store front analytics folks have installed Cassandra & are doing interesting transformations in the NOSQL layer …
Of course the brick and mortar side of the business use HBase; now there is no easy way to combine inferences from the web & store-walk-ins
One or two applications (may be more) are using MongoDB and they also draw data from the HDFS store
And, the recent water cooler conversations indicate that another analytics vendor from Silicon Valley is having top level discussions with the CIO (who by now is frustrated with all the vendors and technology layers) … You might as well order another set of machines and brace for the next vendor wave …

The antidote is called Architecture & a Roadmap (I mean real architecture & roadmap based on business requirements)

Understand that no vendor has the silver bullet and that all their pitches are at best 30% true … One will need products from different vendors, but choose a few (and reject the rest) wisely !

3. Big Data to Nowhere

Unfortunately this is a very common scenario – IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. A conversation goes like this …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

Business : … (unprintable)

The Antidote is very simple. Build the full stack ie bits to business … do the data management ie collect-store-transform as well as a few apps that span the model-reason-predict-infer

Build incremental Decision Data Science & Product Data Science layers, as appropriate … for example the following conversation is a lot better …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

IT: With the data we have, we only know that they comprise ~30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

Now, business got an app, not fully production ready, still useful. IT has proved the value of the platform and can ask for more $ !

4. A Data too far

This is more of a technology challenge. When disparate systems start writing to a common store/file system, the resulting formats and elements would be different. You might get a few .gz files, a few .csv files and of course, parquet files. Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …

5. Technology Maze

This is the flip side of the data format challenge (#4 above) This one is because of the technology stampede and the resulting incompatible systems where data is locked in.

Antidote : Integrated workflows.

Came across an interesting article by Jay Kreps. Couple of quotes:

” .. being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products. But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.”

“…The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.”

“..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

The solution : A pub-sub infrastructure (Kafka) + “surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams“

6. Where is the Tofu ?

This is succinctly articulated by Xavier Amatriain when he talked about the Netflix Prize:

It is very simple to produce “reasonable” recommendations and extremely difficult to improve them to become “great”. But there is a huge difference in business value between reasonable Data Set and great …

The Antidote : The insights and the algorithms should be relevant and scalable … There is a huge gap between Model-Reason and Deploy …

What says thou ? Pl add comments about the Anti Patterns that you have observed and lived thru!

Updates:

[Jan 3, 15] Interesting blog in Forbes “Four Common Mistakes That Can Make For A Toxic Data Lake” says :

Your big data strategy ends at the data lake
The data in your data lake is abstruse
The data in your data lake is entity-unaware
The data in your data lake is not auditable

[]

Ref:
[1] http://2buiart.deviantart.com/art/Swamp-Temple-Night-443413082

[2] https://www.flickr.com/photos/xdmag/1470243032/

Big Data on the other side of the Trough of Disillusionment

October 6, 2013 by ksankar

Ron Kasabian from Intel has an interesting blog at GigaOm on “Four tips to take you beyond the big data hype cycle”.
I also had similar points about the adoption of Big Data in my earlier blogs.
This might be a good time to reiterate couple of points incorporating insights from Ron

5. Don’t implement a technology infrastructure but the end-to-end pipeline a.k.a. Bytes To Business

SImple Reason : Business doesn’t care about a shiny infrastructure, but about capabilities they can take to market …

4. Think Business Relevance and agility from multiple points of view

Aggregate Even Bigger Datasets, Scenarios and Use Cases

Be flexible, tell your stories, leveraging smart data, based on ever changing crisp business use cases & requirements

3. Big Data cuts across enterprise silos – facilitate organization change and adoption

Data always has been siloed, with each function having it’s own datasets – transactional as well as data marts
Big Data, by definition is heterogeneous & muti-schema
Data refresh, source of truth, organizational politics and even fear comes in the picture. Deal with them in a positive way

2. Build Data Products

Now that you have unified and built a data hub, it is time to think about building products out of the data and the insights.

1. tbd

One more for the road …

Jeff Dean : Lessons Learned While Building Infrastructure Software at Google

September 16, 2013 by ksankar

Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

Google’s Jeff Dean had an interesting presentation about his experience building GFS, MapReduce, BigTable & Spanner. For those interested in these papers, I have organized them – A Path through NOSQL Reading

Highlights in pictures (Full slides at XLDB 2013 site):

The Big Data Convergence

April 8, 2013 by ksankar

As we scan the concepts, technologies, products and the practices in the big data space, lot of things get muddier.

Neither the progression nor the boundaries are clear. We are still in the descriptive stage in terms of the application of the analytics technologies.

I had a good conversation with Bob Friday yesterday – his question was “What prevents us from answering 80% of the questions via automatic inferences ?” And that is the “Adaptive” stage we need to be …

I think a diagram is much better than me writing 100,000 words. So here it is :

In many ways, a lot of the underlying technologies are converging.

For example, A(rtificial) I(ntelligence) = NLP + N(atural) L(anguage) U(nderstanding) + ML + K(nowledge) R(epresentation) + Reasoning

Are Amazing Intelligent Machines in the works ?

Big Data State Of The Union

March 21, 2013 by ksankar

An informative study by TCS on the current state of Big Data “The Emerging Big Returns on Big Data”

Of course, you should download and read the whole report. Some interesting highlights:

There’s a polarity in spending on Big Data, with a minority of companies
spending massive amounts and a larger number spending very little
The business functions expecting the greatest ROI on Big Data are not the ones
you may think – while Sales & Marketing have initiatives, finance & logistics are betting on big data for efficiences & insights
The biggest challenges to getting business value from Big Data are as much
cultural as they are technological
Nearly half the data (49%) is unstructured or semi-structured, while 51% is
structured. The heavy use of unstructured data is remarkable given that
just a few years ago it was nearly zero in most companies – Enterprises have gone multi-structured !
Monitoring how customers use their products to detect product and design
ﬂaws is seen as a critical application for Big Data

Cheers & Happy Reading …

Big Data – Technologies, Platforms & Products

February 17, 2013 by ksankar

I came across a good diagram depicting the Big Data Eco System companies. I wanted to overlay the products as well as the Data management/Data Science pipeline viz Collect-Store-…. Here is my diagram. Let me know how I can improve it and what it is missing.

5 Steps to Pragmatic Data …er… Big Data

January 6, 2013 by ksankar

It is 2013 & Big Data is big news … Time to revisit my older (Nov’11) blog “Top 10 Steps to A Pragmatic Big Data Pipeline” … Some things have changed but many have remained the same …

5. Chuck the hype, embrace the concept …

This seems to the first obvious step for organizations. From Ed Dumbill (“Big data” is an imprecise term...) to TechCrunch (“Perhaps it’s about the actual functionality of apps vs. the data“) agree with the concept, but the terms and marketing hypes have hit the proverbial roof. The point is, there are many ponies this pile & there is tremendous business value (so long as one is willing to discount the hype and think Big Data = All Data) …

I really like Mike Gualtieri’s very insightful definition of Big Data as

… the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers

4. Don’t implement a Technology, implement THE Big Data pipeline

Think of Big Data in multiple dimensions than a point technology & evolve the pipeline focussing on all the aspects of the stages

The technologies, the skill sets and the tools are evolving, so are the business requirements.

Chris Taylor addresses this very clearly (“Big Data must not be an elephant riding a bicycle“) – viz. One has to address the entire spectrum to get value …

Simply applying distributed storage and processing (like Hadoop) to extremely large data sets is like putting an elephant on a bicycle .. it just doesn’t make business sense — Chris Taylor

3. Think Hybrid – Big Data Apps, Appliances & Infrastructure

I had addressed this one in my earlier blog(“Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms“)

The morale of the story : Think out-of-the box & inside-the-box.

Match the impedence of the use cases with appropriate technologies

2. Tell your stories, leveraging smart data, based on crisp business use cases & requirements

Evolve the systems incrementally focussing on the business values that determine the stories to tell, the inferences to derive, the feature sets to influence & the recommendations to make

Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. In fact, integration with BI is an interesting challenge for Big Data …

No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

From a business perspective Patrick Keddy of Iron Mountain has a few excellent suggestions on managing Big Data:

Big data informs and enhances judgement and intuition, it should not replace them

Opt for progress over perfection

View the data in context

1. Apply the art of Data Science & Smart Data, paying attention to touch points

This still remains my #1. Data Science is the key differentiator resulting in new insights, new products, order of magnitude performance, new customer base et al – “a cohesive narrative from the numbers & statistics”

“Data science is about trying to create a process that allows you to create new ways of thinking about problems that are novel, or you are trying to use data to create or make something.” says D.J.Patil

Smart Data = Big Data + context + inference + declaratively interactive visualization

Smart Data is (inference) model driven & declaratively interactive
For example,
- The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
- Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data; a gam or glm over the log data after running through several stages of MapReduce is smart data, because it could give you an insight as to what factors or combination of factors make a good phone a bad phone

Focus not only on the Vs (ie Volume,Velocity,Variability & variety) but also on the Cs (ie. Connectedness & Context)

Contextualization – includes integration with BI systems as well as data mashup with structured and unstructured data. An interesting example is how the 2012 US Presidential Election was won – All The President’s Data Scientists and All The President’s DevOps.

The two main Big Data challenges in 2013 would be:

1st : Data integration across silos to get the comprehensive view &

2nd : Matching the real-time velocity of business viz. CEP, sense & respond et al.

For example, I have already seen folking looking outside Hadoop for CEP and near-realtime response

“.. 85% of respondents say the issue is not about the volume of data but the ability to analyze and act on data in real time” says Ryan Hollenbeck quoting a 2012 Cap Gemini study (Italics mine)

Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms

January 4, 2013 by ksankar

I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :

Big Data Borgs & Rise of the Big Data Machines: Big Data apps & Learning Machines will rules says Information Week’s Jeff Bertolucci
The era of Big Data automation w/ purpose-built BigData robots is upon us – predicts Derrick Harris @gigaom
[Update 1] “… New specialized ARM-based servers to do specialty analytics. Advantage being more performance in smaller footprint, with reduced power requirements” says David Cappuccio
Smart Big Data Appliances can also reduce operational complexity given “… for every 25% increase in functionality of a system, there is 100% increase in complexity” again from Gartner’s David Cappuccio
The Revenge of the Algorithms : Big Data Algorithm Wars will begin, says Forrester’s Mike Gualtieri
- I agree with Mike that Big Data = all data
- Mike’s Definition of Big Data is very insightful & right on the dot
  
  … the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers
- Also Mike is right in saying that Real-Time architectures will be prominent. In my discussions with companies, this need already is on the forefront. A well-crafted analytic infrastructure using Hadoop has a real-time least count of ~15 min (Eg. Facebook’s infrastructure)
Big Data = [Business + Technology] [Requirements + Users] : Big Data will move out of sandbox with crisp business and technology requirements says John Bantleman @wired
[Update 1] The Era of Smart Data : Actually the concept of Smart Data = Big Data + context + inference + declaratively interactive visualization has been there for a couple of years
IBM’s James Kobielus & colleagues are big on cross-scale architectures & cross-application deployments. I agree.
“… big data is about sensing, algorithmic discovery and gaining deeper insight through data” says Ed Dumbill in Forbes. Well said.
- Ed continues to say that “Essentially, the emergence of a global digital nervous system” … “
- I am neutral about it … Feedback loop is good, interlinked data is good, but taking Big Data as the digital nervous system … well that makes me nervous ;o)
[Update 1] The era of Pragmatic Big Data : Companies have to decouple the marketing hype from the facts on the ground. Do what works for you, not what the hype says. The Top 10 Steps for a Pragmatic Big Data remains almost the same as when I wrote the blog an year ago ! Have an update “5 Steps To Pragmatic Data …er… Big Data” … What says thee ?

What or Who is a Data Scientist ?

December 15, 2012 by ksankar

In many ways Data Scientist is an elusive role. As we internalize the application of big data, the role and the skill sets would evolve …
The two important questions are – “What would a Data Scientist do ?” & “What tools/skillsets does a Data Scientist have in one’s backpocket ?”
Some believe that “A Data Scientist’s Real Job is Storytelling”
- Jeff Bladt & Bob Filbin at HBR have 3 interesting points
  1. Look only for data that affect your organization’s key metrics
  2. Present data so that everyone can grasp the insights
  3. Return to the data with new questions
Forbes article “The Where, Who, and Why of Data Scientists” is an interesting read
[Update 1] Irving Wladawsky-Berger talks about Data Scientists having hybrid set of IT skills. Interesting blog.
[Update 2] To those who came here directly, a forward pinter to my short blog on Data Science Engineers. To others, this will put you in a loop ;o)
[Update 3] Defining & Differentiating the role of a Data Scientist by Doug Laney is interesting
- Defying the conventional wisdom of Data Scientist = PhD in Statistics, I think Data Scientist is much broader, but with focus on Modes, Algorithms, Visualization, Prediction & Exploration of Big Data, Of course.
[Update 4] Data Science beyond the Hype – interesting blog
Here is my quick take – an equation and a picture to go with it !

DataScientist = Part Hacker + Part Technologist + Part Detective + Part Scientist + Part Business Analyst + Part Visual Artist

[Update 12/22/12] GigaOm says : Data Science = Data Architecture + Machine Learning + Analytics. Makes sense. I have updated my diagram accrdingly
[Update 1/13/14] “Soft Skills – to understand what actually needs to be solved is important” says John Forman. Interesting read
[Update 1/19/14] An interesting article in ACM – Data Science & Prediction
[Update 1/19/14] Information Week : Big Data Analytics – Descriptive vs. Predictive vs. Prescriptive
[Update 1/25/14] Some more thoughts on Data Science in my blog The curious case of the Data Science Profession

My missives

Category Archives: Hadoop

Business Users Shouldn’t touch Hadoop even with a 99-foot pole !