The Bridges of Pittsburgh County … That Autonomous Cars can’t SLAM through !


Same as my post in Linkedin

Autonomous cars do bring out interesting nuances to the normal things that we take for granted and don’t think twice about !

Business Insider’s article “Here’s why self-driving cars can’t handle bridges” fits this category.

“Bridges are really hard,… and there are like 500 bridges in Pittsburgh.”

Of course, it is the infinite (or near infinite) context that we, humans can process and machines aren’t even close … But, one would think bridges would be easier – no distractions, well designed straight roads; of course with the current GPS accuracy, the car might think that it is in the water and start rolling out it’s fins !!

“You have a lot of infrastructure on the bridge above the level of the car that we as humans take into account, … But when you sense those things with a sensor that doesn’t have the domain knowledge that we do … you could imagine that the girders coming up from the side of the bridge and that kind of thing would be disturbing or possibly confusing.”

Pittsburgh does have bridges, lots of them … There is even a BBC documentary! Even how the city deals with the bridges is interesting.

In fact Pittsburgh is called “The City of Bridges”, even though some have different interpretations (we will come to that discussion in a minute)

While we are on the subject, I do have a couple of books for the Uber Car to read ! It can even order them through it’s robotic friend Alexa ! or drive to wherever fine books are sold, on it’s own time – Uber might not pay for the impromptu solo drive.

  1. Bob Regan’s Book is the first one to read
  2. Next is Pittsburgh’s Bridges (Images of America)
  3. The book Bridges… Pittsburgh at the Point… a Journey Through History gives interesting perspectives the riders would enjoy (of course, the ones with enquiring minds…)
  4. Finally, the hardcore bridge fans would be thrilled to hear from Pittsburgh’s Bridges: Architecture and Engineering

Now, to SLAM, it is the set of algorithms collectively called Simultaneous Localization And Mapping – a very interesting topic by itself.

In short, a SLAM system needs known points in addition to unknown points, to reason about & figure out it’s trajectory – bridges have less of known points it can rely on …

We can definitely employ Deep Learning ConvNets as well as traditional computer vision with a dash of contextualization is a good start … that is a topic for another time (sooner than later…). Probably an interesting opportunity for bridges.ai or openbridges.org

For those snappy Machine Learning experts, there is even a Pittsburgh Bridges Data Set at UCI, to start with ! Probably nowhere near the data needed to train modern Convolutional Nets, but one can augment the images with algorithms like Flip, Jitter, Random Crop and Scale et al.

If we think Pittsburgh is difficult, wait until Uber starts autonomous driving in Amsterdam ! While Pittsburgh has 446 bridges, many sources put Amsterdam with over 1000 bridges that cars can travel. There are many bicycle and pedestrian bridges in Amsterdam that an Uber car wouldn’t be interested in – except, of course, to pick up the tired pedestrians ;o). The which-city-has-max-number-of-bridges discussions can be followed here:

  1. http://www.wtae.com/Just-How-Many-Bridges-Are-There-In-Pittsburgh/7685514
  2. http://nolongerslowblog.blogspot.com/2014/02/what-city-has-most-bridges-and-why-is.html
  3. https://www.quora.com/Which-city-has-the-most-bridges

Trivia:

  • The bridges in Pittsburgh are not painted Yellow (as one might tend to think) but Aztec Gold !
  • And yep, it is Allegheny County ! But Pittsburgh rhymes better ;o)

Cheers

 

A Stylistic Chronicled Guide to Isaac Asimov


It is an experience to read and brood over the writings of Isaac Asimov.

If you are just starting, I envy you – you have a wonderful journey ahead of you !!

asimov-00

But don’t just start a series randomly, the journey has a very disciplined roadmap, so that the mysteries of the world of robots will be systematically revealed to you.

Asimov is a required reading for anyone working on Autonomous Cars, Artificial Intelligence and to a lesser extent, Machine Learning.

Detour : Other must read AI books include “The Master Algorithm“,  “Final Jeopardy”  – among a lot of good books …

The books are more relevant now than then – you see, then it was science fiction, now the concepts are turning into reality !!!

As the Reddit Series Guide mentions, you can follow the publishing order or the internal story chronological order. But both are non-optimal and I think the orders would interfere with the reader’s thinking.

Isaac Asimov, himself, has suggested an order, which is more closer to my thinking but still not quite …

[Note : I pieced together the list from various discussions in Reddit and will note original comments within quotes]

First things first – read the Robot Series, in chronological/publication order. You have to meet Elijah Baley and R. Daneel Olivaw !

A) The Caves of Steel 

B) The Naked Sun

C) The Robots of Dawn

D) Robots and Empire

asimov-03

Then comes the Foundation Series.

The two common recommendations are to read these either publication order or chronological order.

I have a third recommendation: start with the original trilogy, then read the prequels, and end with Edge and Earth. …

This gives a good arrangement stylistically, with the earlier novels followed by the later ones. Asimov’s writing style changes distinctly over time. It also gives a good arrangement chronologically, with the prequels foreshadowing the final two books, instead of explaining things you’ve already read about.

And best of all, you end with the cliffhanger, instead of reading it and then reading 2-5 more books that don’t resolve it.

The following order “preserves the mystery the first-time reader would have going into the first Foundation book. Part of the enjoyment of the Foundation novel is that you don’t know who Seldon is, in those opening scenes on Trantor, or what role he’s going to play in the story. If you read Prelude and Forward first, you’ll already have an earful about Trantor and Seldon before you get to Seldon’s introduction through Gaal Dornick’s eyes in Foundation

asimov-04

E) Foundation

F) Foundation and Empire

G) Second Foundation

H) Foundation’s Edge

I) Foundation and Earth

J) And finally read Stephen Collins’ Conclusion The Foundation’s Resolve. I found it a satisfying end to a great saga

(Optional – Rest of the Foundation Books)

i) Prelude to Foundation

ii) Foundation’s Fear (if you really must)

iii) Forward the Foundation

iv) Foundation and Chaos

v) Foundation’s Triumph

Now you are read for the rest.

K) “Complete Robots”

The books are packaged with overlapping content. Questions like “The Complete Robots” vs “”Robot Dreams” & “Robot Visions” vs “I, Robot” comes up all the time. This Reddit discussion addresses this dilemma.

Then you can diverge to other books like Nemesis and The End Of Eternity. The [Galactic] Empire Series are not essential, but do read them – “The Currents of Space”, “The Stars, Like Dust” and “Pebble in the Sky”. Publishing order is fine.

You should also explore Isaac Asimov’s Home Page.

Now you are part of the Asimov club – And have one interesting task to do – which is feedback ! Add comments to this blog with insights – you could even add a new roadmap guid of your own with a very different POV !!!

Yann LeCun @deeplearningcdf Collège de France


I am spending this weekend with Yann LeCun (virtually, of course) studying the excellent video Lectures and slides at the College de France. A set of 8 lectures by Yann LeCun (BTW pronounced as LuCaan) and 6 guest lectures. The translator does an excellent job – especially as it involves technical terms and concepts !

(I will post a few essential slides for each video …)

Inaugural Reading – Deep Learning: a Revolution in Artificial Intelligence

LeCunn-01

My favorite slide – of course !!! And the DGX-1 !!

Missing Pieces of AI – interesting …

The reasoning, attention,episodic memory and  a rational behavior based on a value system are my focus for autonomous vehicles (cars & drones!)

Convnets are everywhere !

Probably the most important slides of the entire talk – the future of AI.

Parse it couple of times, it is loaded with things that we should pay attention to …

Can AI beat hardwired bad behaviors ?

LeCunn-06

I agree here, here, here and here – we don’t want AI to imitate us, but take us to higher levels !

Stay tuned for rest of the video summaries …..

What would you want AI to do, if it could do whatever you wanted it to do ?


P.S: Copy of my blog in Linkedin

Note : I am capturing interesting updates at the end of the blog.

AI-06-01

Exponential Advances:

An interesting article in Nature points out that exponential advanced in technological growth can result is a very alternate world very soon.

IBM X Prize:

 

AI-07

And the IBM AI X Prize is offering a chance to showcase powerful ideas that tackle challenges.

Got me thinking … What do would we want our machines/AI to do ?

I am interested in your thoughts. Pl comment on what you would like AI to do.

Earlier I had written about us not wanting our machines to be like us; understand us – may be, help us – definitely, but imitate us – absolutely not  …

So what does that mean ?

  • Driving cars ? – Definitely
  • Image recognition, translation and similar tasks ? – Absolutely
  • Write like Shakespeare just by feeding all the plays to a neural network like the LSTM ? – Definitely not !

I see folks training deep Learning systems by feeding them Shakespeare plays and see what the AI can write. Good exercise, but is that something we would get an X Prize for ? Of course, that is putting the cart before the horse !

We don’t write just by memorizing the dictionary and Elements of Style !!

  • We write because we have a story to tell.
  • The story comes before writing;
  • Experience & imagination comes before a story …
  • A good story requires both the narrative power as well as a powerful content with it’s own anti-climax, and of course the hanging chads ;o)
  • Which the current AI systems do not possess …
  • Already we have robots (Google Atlas) that can walk like a human – leaving aside the the goofy gait – which, of course, is mainly a mechanical/balance problem than an AI challenge
  • Robots can drive way better than a human
  • They translate a lot better than humans can (Of course language semantics is a lot more mechanical than storytelling)
  • Robots and AI machines do all kinds of stuff (Even though Mercedes Assembly plant found that they cannot handle the versatile customization!)

Is there anything remaining for an AI prize One wonders …

In the article “How Google’s impressive new robot demo will fuel your nightmares” , at 2:09, the human (very rudely) pushes the robot to the ground and the robot gets up on it’s own ! That proves that we have solved the mechanical aspects of human anatomy w.r.t movements & balance.

[Update 3/17/16] Looks like Google is pushing Boston Dynamics out of the fold !

AI-08

 

But a meta question remains.

  • Would the robot be upset at the human ?
  • Would it know the difference – if it was pushed to keep it away from harm’s way (say a falling object) vs. out of spite ?
  • And, if we later hug the robot (as the author suggests we do) would it feel better ?
  • Will it forget the insult ?

So there is something to be done after all !

Impart into our AI – the capability to imagine, the ability to understand what life is;  feel sadness & joy; understand what it is to struggle through a loss,…

This is important – for example, if we want robots to act as companions for the sick, the elderly and the disabled, may be even the occasional lonely, the desolate and for that matter even the joyous!

If the AI cannot comprehend sadness, how can it offer condolences as a companion ? Wouldn’t understanding our state of mind help it to help us better? 

AI-09

 In many ways, by helping AI to understand us, the ultimate utility might not be whether AI really comprehends us or not, but whether we get to understand us better, in the process !! And that might be the best outcome out of all of these innovations.

As H2O-ai Chief SriSatish points out,

Over the past 100 years, we’ve been training humans to be as punctual and predictable as machines; … we’re so used to being machines at work—AI frees us up to be humans again ! – Well said SriSatish

With these points in mind, it is interesting to speculate what the AI X-Prize TED talks would look like in 2017; in 2018. And what better way to predict the future than to invent it ? I am planning on working on one or two submissions …

And what says thee ?

[Update 3/12/16] Very interesting article in GoGaneGuru about AlphaGo’s 3rd win.

AI-11

  • AlphaGo’s strength was simply remarkable and it was hard not to feel Lee’s pain
  • Having answered many questions about AlphaGo’s strengths and weakness, and exhausting every reasonable possibility of reversing the game, Lee was tired and defeated. He resigned after 176 moves.
  • It’s time for broader discussion about how human society will adapt to this emerging technology !!

And Jason Millar @guardian agrees.

AI-12

Maybe all is not lost after all, WSJ says … !

AI-13.png

[Update 3/9/16] Rolling Stone has a 2-part report – Inside the Artificial Intelligence Revolution. They end the report with a very ominous statement.

AI-10

 

[Update 3/4/16] Baidu Chief Scientist Andrew Ng has insightful observations

  • “What I see today is that computer-driven cars are a fundamentally different thing than human-driven cars and we should not treat them the same”- so true !

[Update 3/6/16] An interesting post from Tom Devenport about Cognitive Computing.

  • Good insights into what Cognitive Computing is, as a combination of Intelligence(Algorithms), Inference(Knowledge) and Interface (Visualization, Recommendation, Prediction,…)
  • IMHO, Cognitive Computing is more than Analytics over unstructured data, it also has touches of AI in there.
  • Reason being, Cognitive Computing understands humans – whether it is about buying patterns or the way different bodies reacts to drugs or the various forms of diseases or even the way humans work and interact
  • And that knowledge is the difference between Analytics and Cognitive Computing !

I like Cognitive Computing as an important part of AI, probably that is where most of the applications are … again understanding humans rather than being humans !

Reference & Thanks:

My thanks to the following links from which I created the collage:

  1. http://www.nature.com/news/a-world-where-everyone-has-a-robot-why-2040-could-blow-your-mind-1.19431
  2. http://lifehacker.com/the-three-most-common-types-of-dumb-mistakes-we-all-mak-1760826426
  3. https://e27.co/know-artificial-intelligence-business-20160223/
  4. http://spectrum.ieee.org/tech-talk/computing/software/digital-baby-project-aims-for-computers-to-see-like-humans
  5. http://www.techrepublic.com/article/10-artificial-intelligence-researchers-to-follow-on-twitter/
  6. http://www.lifehack.org/309644/book-lovers-alert-8-the-most-spectacular-libraries-the-world
  7. http://www.headlines-news.com/2016/02/18/890838/can-ai-fix-the-world-ibm-ted-and-x-prize-will-give-you-5-million-to-prove-it
  8. http://www.lifehack.org/366158/10-truly-amazing-places-you-should-visit-india?ref=tp&n=1
  9. http://www.lifehack.org/articles/lifestyle/20-most-magnificent-places-read-books.html

Twitter 2.0 = Curated Signals + Applied Intelligence + Stratified Inference


P.S: Copy of my blog in Linkedin

Exec Summary:

One possible trajectory and locus (“product cadence”) for Twitter 2.0 is to be a platform – to tell stories with different levels of abstraction – from basic curated signals to aggregated intelligence (ie trends, positions, sentiments and issues) & finally the higher order of exposing stratified inference built on the signals and intelligence.

For example CPC advertisers might want to know “Who is an NBA Fan” for personalized ad campaigns based on the interest graph (We did a similar project few years ago, based on Twitter data)

All without sacrificing the core Twitter consumption experience, but adding different dimensions to Twitter consumption …

Constituents like political campaigns (pardon the pun ;o)) can consume the platform at different levels and sophistication. All the (potential!) President’s Data Scientists can run multiple models over the signals while All the President’s Devops can build dashboards for the strategists to consume the curated inferences

Twitter Network != Facebook Network; Twitter Graph != LinkedIn graph ie. Twitter is an interest graph, not a social graph. If so, why can’t Twitter expose the interest graph as a first class entity, with appropriate intelligence?

Twitter is the right platform for ad-hoc,ephemeral spaces to exchange quick notes.

[Update : Julia Boorstin’s blog What’s Next for Twitter also echoes many of my recommendations below]

Details:

Due to various reasons I have been contemplating about Twitter in general and specifically Political Campaigns as an example of an eco system where Twitter has lots of potential

Twitter has been in the news recently with the CEO change as well as the stock dip. Time for Twitter 2.0 ? Definitely !

Interestingly I had written about Twitter 2.0 in 2011 and most of it is still true ! I will include relevant parts from that blog

For the technically minded who are into the gory details, pl refer to materials from my 2012 OSCON tutorial [Social Network Analysis with Twitter] 

What do campaigns want ?

They want curated inference (which they can directly consume for actionable outcomes) and curated intelligence (for overlaying specialized models over the exposed signals at different orders). All the President’s Data Scientists would have interesting data science models over the Twitter signals. A general model is like so:

Twitter 2.0 – Trajectory & Locus

Now Twitter is an agora for pure message-based interactions; but it has lot more potential – to be a platform (of course,without sacrificing the essential nature of the medium) ! To get there, it needs to be proactive, providing different levels of abstraction – from the basic curated signals to aggregated intelligence (ie trends, positions, sentiments and issues) and the higher level of stratified inference. It also should provide congruences on Twitter to Rest-Of-The-World ie how indicative are the twitter-verse of the general population.

Topic Streams a.k.a TweeTopics


I use Twitter for 3 things – to keep current with topics that interest me, keep in touch with friends & acquaintances and finally publish things that I am interested in – many times as a bookmark !

It is almost impossible to follow topics. The List functionality never worked for me. It should be as easy to follow & unfollow at the level of topics. In the day and age, it is not that hard to run the tweets through a set of analytics engines, cluster them by subjects and offer the topics, with the same semantics as people ! The current interaction semantics are very relevant – that is what makes Twitter Twitter.

There was some thoughts about tweet threading – I think that defeats the purpose; tweets are stateless and that attribute is very important

Twitter is different from facebook and linkedin, it is not a social graph but an interest graph. Many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense. But, others like Cliques and Bipartite Graphs do

Why can’t Twitter expose the interest graph, with appropriate intelligence ?

Topic Spaces a.k.a. TweetSpaces a.k.a TweetRooms

Twitter is the right platform for ad-hoc,ephemeral spaces to exchange quick notes.

This was my observation in 2011, and still it is true.

IM is too heavy weight and not that easy for quick things like “Where is that meeting room” or “Which seat are you in” or “What should we discuss next” et al. A one-to-many exchange, between people who are spatially (and even temporally) in separate spaces. They might in a plane, on a call or even in a hallway! Should be easy to add  a “!” tag, and shout the info. Yep, folks need to know what the ! tag is. Actually come to think of it, we could have many types of tags using a lot of the ‘$’,’@’,’%’,’^’,’&’ and ‘*’ characters with different semantics!

Time for a “Tweet Mark-up Language” ?

These are some of my quick thoughts, what says thee ?

The Art of NFL Ranking, the ELO Algorithm and FiveThirtyEight


In this blog, I will focus on the NFL Ranking based on the ELO algorithm that Nate Silver’s FiveThirtyeight uses. The guys at 538 have done a good job.The ELO and NFL ranking was part of my workshop at the Global Big Data Conference this Sunday. The full presentation is in slideshare


ELO – the algorithm made famous by Facebook & depicted in the movie Social Network

gbdc-r-04-P30


 Basic ELO

gbdc-r-05-P20

The k-Factor is the main leverage point to customize the algorithm for different domains.

  • For example Chess has no notion of a season; Soccer,Football & Basket ball are dependent on seasons – teams change during different seasons
  • Chess has no score to consider except WIn,Lose or Draw; but ball games have scores that need to be accommodated
  • For Chess k=10; for soccer it varies from 20 to 60; 20 for friendly matches to 60 for World Cup Finals
  • As we will see later, NFL adjusts k with the Margin Of Victory Multiplier
  • NFL also adjusts k to weigh recent games more heavily, w/ exponential decay
  • There are also mechanisms for weighing playoffs higher than regular season games (We will see this in Basketball)

538’s take on ELO

gbdc-r-05-P21

gbdc-r-05-P22


NFL 2014 Predicts & Results

The R program ELO-538.R is in Github

2014 Ranking Table

gbdc-r-05-P27

gbdc-r-05-P29

gbdc-r-05-P31

gbdc-r-05-P32


To Do

  1. Exponential decay with more weight for recent games – later in the season
  2. Calculate the rankings from 1940 to present, draw graphs like this from 538

The Best Of the Worst – AntiPatterns & Antidotes in Big Data


Recently I was a panelist at the Big Data Tech Con discussing the Best of the Worst Practices in Big Data. We had an interesting and lively discussion with pragmatic,probing questions from the audience. I will post the video, if they publish it.

Here are my top 5 (Actually Bakers 5 ;o) ) – Anti Patterns and of course, the Antidotes …

1. Data Swamp

swamp-01Typical case of “ungoverned data stores addressing a limited data science audience“.

The scene reads like so :

The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. Now every one starts putting their data into this “lake”. After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence.

Larry at IT starts the data import process every week sometime in Friday night – they don’t want to load the DW with a Hadoop feed during customer hours. Sometimes Larry forgets and so you have no clue if the customer data is updated every week or every 14 days; there could be duplicates or missing record sets …

Gartner has an interesting piece about Data Lakes turning to Data Swamps – The Data Lake Fallacy: All Water and Little Substance … A must read …


The antidote is Data Curation. It needn’t be a very process heavy – have a consistent schema & publish them in a Wiki page. Of course, if you are part of a big organization (say retail) and have petabytes of data, naturally it calls for a streamlined process.

Data quality, data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline



2. Technology Stampede & Vendor Pollution

  • You start with a few machines, install Apache Hadoop and start loading data. Marketing fols catchup on your big data project and now you have a few apps running. A few projects come onboard and you have a decent sized cluster
  • A famous cloudy Hadoop vendor approaches the management and before you know it, the company has an enterprise license and you are installing the vendor’s Hadoop Manager & MPP database named after a GM car.
  • Naturally the management expects the organization to morph into a data-driven, buzz-compliant organization with this transition. And, naturally a Hadoop infrastructure alone is not going to suddenly morph the organization to a big data analytics poster child … vendor assertions aside …
  • And behold, the Pivotal guys approach another management layer and inevitably an enterprise license deal follows … their engineers come in and revamp the architecture, data formats, flows,…
    • Now you have Apache, the cloudy Hadoop vendor and this Pivotal thing – all Hadoop, all MPP databases, but of course, subtle differences & proprietary formats prevent you from doing anything substantial across the products …
      • In fact even though their offices are few miles apart in US-101, their products look like they are developed by people in different planets !
  • While all is going on, the web store front analytics folks have installed Cassandra & are doing interesting transformations in the NOSQL layer …
  • Of course the brick and mortar side of the business use HBase; now there is no easy way to combine inferences from the web & store-walk-ins
  • One or two applications (may be more) are using MongoDB and they also draw data from the HDFS store
  • And, the recent water cooler conversations indicate that another analytics vendor from Silicon Valley is having top level discussions with the CIO (who by now is frustrated with all the vendors and technology layers) … You might as well order another set of machines and brace for the next vendor wave …

The antidote is called Architecture & a Roadmap (I mean real architecture & roadmap based on business requirements)

Understand that no vendor has the silver bullet and that all their pitches are at best 30% true … One will need products from different vendors, but choose a few (and reject the rest) wisely ! 



3. Big Data to Nowhere

bridge-to-nowhereUnfortunately this is a very common scenario – IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. A conversation goes like this …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

Business : … (unprintable)


The Antidote is very simple. Build the full stack ie bits to business … do the data management ie collect-store-transform as well as a few apps that span the model-reason-predict-infer

Build incremental Decision Data Science & Product Data Science layers, as appropriate … for example the following conversation is a lot better …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? 

IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

IT: With the data we have, we only know that they comprise ~30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

Now, business got an app, not fully production ready, still useful. IT has proved the value of the platform and can ask for more $ !



4. A Data too far

This is more of a technology challenge. When disparate systems start writing to a common store/file system, the resulting formats and elements would be different. You might get a few .gz files, a few .csv files and of course, parquet files. Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …



5. Technology Maze

This is the flip side of the data format challenge (#4 above) This one is because of the technology stampede and the resulting incompatible systems where data is locked in.

Antidote : Integrated workflows.

Came across an interesting article by Jay Kreps. Couple of quotes:

” .. being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products. But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.”

“…The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.”

“..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

The solution : A pub-sub infrastructure (Kafka) + “surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams



6. Where is the Tofu ?

This is succinctly articulated by Xavier Amatriain when he talked about the Netflix Prize:

It is very simple to produce “reasonable” recommendations and extremely difficult to improve them to become “great”. But there is a huge difference in business value between reasonable Data Set and great …

The Antidote : The insights and the algorithms should be relevant and scalable … There is a huge gap between Model-Reason and Deploy …



What says thou ? Pl add comments about the Anti Patterns that you have observed and lived thru!



Updates:

[Jan 3, 15] Interesting blog in Forbes “Four Common Mistakes That Can Make For A Toxic Data Lake” says :

  1. Your big data strategy ends at the data lake

  2. The data in your data lake is abstruse

  3. The data in your data lake is entity-unaware

  4. The data in your data lake is not auditable

[]

[]

Ref:
[1] http://2buiart.deviantart.com/art/Swamp-Temple-Night-443413082

[2] https://www.flickr.com/photos/xdmag/1470243032/