The Best Of the Worst – AntiPatterns & Antidotes in Big Data


Recently I was a panelist at the Big Data Tech Con discussing the Best of the Worst Practices in Big Data. We had an interesting and lively discussion with pragmatic,probing questions from the audience. I will post the video, if they publish it.

Here are my top 5 (Actually Bakers 5 ;o) ) – Anti Patterns and of course, the Antidotes …

1. Data Swamp

swamp-01Typical case of “ungoverned data stores addressing a limited data science audience“.

The scene reads like so :

The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. Now every one starts putting their data into this “lake”. After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence.

Larry at IT starts the data import process every week sometime in Friday night – they don’t want to load the DW with a Hadoop feed during customer hours. Sometimes Larry forgets and so you have no clue if the customer data is updated every week or every 14 days; there could be duplicates or missing record sets …

Gartner has an interesting piece about Data Lakes turning to Data Swamps – The Data Lake Fallacy: All Water and Little Substance … A must read …


The antidote is Data Curation. It needn’t be a very process heavy – have a consistent schema & publish them in a Wiki page. Of course, if you are part of a big organization (say retail) and have petabytes of data, naturally it calls for a streamlined process.

Data quality, data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline



2. Technology Stampede & Vendor Pollution

  • You start with a few machines, install Apache Hadoop and start loading data. Marketing fols catchup on your big data project and now you have a few apps running. A few projects come onboard and you have a decent sized cluster
  • A famous cloudy Hadoop vendor approaches the management and before you know it, the company has an enterprise license and you are installing the vendor’s Hadoop Manager & MPP database named after a GM car.
  • Naturally the management expects the organization to morph into a data-driven, buzz-compliant organization with this transition. And, naturally a Hadoop infrastructure alone is not going to suddenly morph the organization to a big data analytics poster child … vendor assertions aside …
  • And behold, the Pivotal guys approach another management layer and inevitably an enterprise license deal follows … their engineers come in and revamp the architecture, data formats, flows,…
    • Now you have Apache, the cloudy Hadoop vendor and this Pivotal thing – all Hadoop, all MPP databases, but of course, subtle differences & proprietary formats prevent you from doing anything substantial across the products …
      • In fact even though their offices are few miles apart in US-101, their products look like they are developed by people in different planets !
  • While all is going on, the web store front analytics folks have installed Cassandra & are doing interesting transformations in the NOSQL layer …
  • Of course the brick and mortar side of the business use HBase; now there is no easy way to combine inferences from the web & store-walk-ins
  • One or two applications (may be more) are using MongoDB and they also draw data from the HDFS store
  • And, the recent water cooler conversations indicate that another analytics vendor from Silicon Valley is having top level discussions with the CIO (who by now is frustrated with all the vendors and technology layers) … You might as well order another set of machines and brace for the next vendor wave …

The antidote is called Architecture & a Roadmap (I mean real architecture & roadmap based on business requirements)

Understand that no vendor has the silver bullet and that all their pitches are at best 30% true … One will need products from different vendors, but choose a few (and reject the rest) wisely ! 



3. Big Data to Nowhere

bridge-to-nowhereUnfortunately this is a very common scenario – IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. A conversation goes like this …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

Business : … (unprintable)


The Antidote is very simple. Build the full stack ie bits to business … do the data management ie collect-store-transform as well as a few apps that span the model-reason-predict-infer

Build incremental Decision Data Science & Product Data Science layers, as appropriate … for example the following conversation is a lot better …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

IT: With the data we have, we only know that they comprise ~30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

Now, business got an app, not fully production ready, still useful. IT has proved the value of the platform and can ask for more $ !



4. A Data too far

This is more of a technology challenge. When disparate systems start writing to a common store/file system, the resulting formats and elements would be different. You might get a few .gz files, a few .csv files and of course, parquet files. Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …



5. Technology Maze

This is the flip side of the data format challenge (#4 above) This one is because of the technology stampede and the resulting incompatible systems where data is locked in.

Antidote : Integrated workflows.

Came across an interesting article by Jay Kreps. Couple of quotes:

” .. being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products. But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.”

“…The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.”

“..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

The solution : A pub-sub infrastructure (Kafka) + “surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams



5. Where is the Tofu ?

This is succinctly articulated by Xavier Amatriain when he talked about the Netflix Prize:

It is very simple to produce “reasonable” recommendations and extremely difficult to improve them to become “great”. But there is a huge difference in business value between reasonable Data Set and great …

The Antidote : The insights and the algorithms should be relevant and scalable … There is a huge gap between Model-Reason and Deploy …



What says thou ? Pl add comments about the Anti Patterns that you have observed and lived thru!



Ref:
[1] http://2buiart.deviantart.com/art/Swamp-Temple-Night-443413082

[2] https://www.flickr.com/photos/xdmag/1470243032/

Human Ingenuity puts a Fiat 500 C infront of VW HQ in Google Street View


Image

Interesting Story at Zagg Blog

Fiat’s workers spotted the Google Street view car, which captures panoramic views of locations around the world with a roof camera, driving past its offices. They quickly drove a Fiat 500C to Volkswagen’s head office @ Södertälje 45 minutes away and parked there ! Just in Time for the StreetView Camera to snap the following picture !

Now the Googel Pegman shows the Fiat in front of VW HQ !

An ode to the Easter Eggs, Ecstasies & Agonies of a GoogleIO Ticket


Chronicles of my failed attempt at procuring a GioogleIO Ticket … The Google Wallet ate my GogleIO 2013 Ticket !

It was the night before GoogleIO … Excitement was in the air … Tweets were in order …

Image

The order of the day was to find all Easter Eggs in the page …

Image

I clicked and clicked and clicked … and got thru all the easter Eggs …

Image

Image

Image

Image

Image

Image

Image

Image

Image

Image

Image

And I slept …

It was early AM when I woke up … still 15 min before the GoogleIO stores open …

Image

The wait was agonizing, but all for a good cause, so I thought …

I was there when the GoogleIO Ticket store opened …

Image

I was not disappointed when my first try failed after 6 minutes …

Image

And my optimism payed off when it eventually found me a precious little ticket …

Image

I reviewed the purchase … and gave it to Google Wallet … little did I know that …

Image

But the screen stayed there and the time ticked down ….

By now the verdict was clear – The Google Wallet is going to eat my lucky GoogleIO Ticket ….

And It did …..

Image

And soon after the registration ended …. The cold hand of fate …

Image

Can I find a kind soul at Google to help me or should I wait for GoogleIo 2014 ? ….

Tomcat 7 packaging layout & install in Ubuntu 12.04


Today I installed Tomcat7 in Ubuntu. They have changed the layout of the directories and the changes are for good. We all are used to the Tomcat’s old layout and the new layout takes a little time to get used to … At first I couldn’t make head or tail out of it. Then I looked for things where they should be … and viola … it made perfect sense !

  • /usr/share/tomcat7 <- The bn, lib and the rest
  • /usr/share/tomcat7-root <- Am still figuring out what this does
  • /etc/tomcat7/ <- configuration files
  • /var/log/tomcat7/ <- logs. For some reason this directory is rwxr-x—-. Really should be  wxr-xr-x
  • /var/lib/tomcat7/ <- webapps et al go here