5 Lessons on AI from the Tesla Autopilot Fatality

July 3, 2016 by ksankar

Unfortunately it takes extreme repercussions for us to feel in our bones, the limitations of our technologies.

Three points :

I have included relevant links about this incident at the end of the blog (incl the AutoPilot v8 with Radar). Informative read
One of my parent died in an automobile accident; so I do know, first hand, the human toll – I do not take this lightly; in fact the reverse is true
And the views expressed in my writing are my own and do not reflect any organization I am part of … now or in the future …

Lesson 1 : Our machines inherit our faults (so far …)

As I pointed out in one of my AI blogs:

Robot-06

Lesson 2 : Many domains are not forgiving to byzantine failures

We are learning that painful lesson whether they are rockets, airplanes or cars. Even though we freak out of snapchat is down for an hour, we can survive that, but not these. The drivers need to understand the downside of technologies and be alert.

Lesson 3 : Mission Critical Systems should have redundancy, over coverage & independency

For example multiple sensor sources & probably independent situational interpretation. I saw the following from somewhere where the Japanese Ministry talks about “correcting the wrong train of thoughts”:

Lesson 4 : Swarm Intelligence

Lesson 5 : This might lead to some level of Standardization & Legalization

Standardization of components & protocols
Legislation/Standardization of algorithms or semantic behaviors incl image recognition, policies and pragmas …
Even driver education and certification to dive autonomous vehicles !

Robot-05

Reference:

Yann LeCun @deeplearningcdf Collège de France

June 25, 2016 by ksankar

I am spending this weekend with Yann LeCun (virtually, of course) studying the excellent video Lectures and slides at the College de France. A set of 8 lectures by Yann LeCun (BTW pronounced as LuCaan) and 6 guest lectures. The translator does an excellent job – especially as it involves technical terms and concepts !

(I will post a few essential slides for each video …)

Inaugural Reading – Deep Learning: a Revolution in Artificial Intelligence

LeCunn-01

My favorite slide – of course !!! And the DGX-1 !!

Missing Pieces of AI – interesting …

The reasoning, attention,episodic memory and a rational behavior based on a value system are my focus for autonomous vehicles (cars & drones!)

Convnets are everywhere !

Probably the most important slides of the entire talk – the future of AI.

Parse it couple of times, it is loaded with things that we should pay attention to …

Can AI beat hardwired bad behaviors ?

LeCunn-06

I agree here, here, here and here – we don’t want AI to imitate us, but take us to higher levels !

Stay tuned for rest of the video summaries …..

The Best Of the Worst – AntiPatterns & Antidotes in Big Data

November 6, 2014 by ksankar

Recently I was a panelist at the Big Data Tech Con discussing the Best of the Worst Practices in Big Data. We had an interesting and lively discussion with pragmatic,probing questions from the audience. I will post the video, if they publish it.

Here are my top 5 (Actually Bakers 5 ;o) ) – Anti Patterns and of course, the Antidotes …

1. Data Swamp

Typical case of “ungoverned data stores addressing a limited data science audience“.

The scene reads like so :

The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. Now every one starts putting their data into this “lake”. After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence.

Larry at IT starts the data import process every week sometime in Friday night – they don’t want to load the DW with a Hadoop feed during customer hours. Sometimes Larry forgets and so you have no clue if the customer data is updated every week or every 14 days; there could be duplicates or missing record sets …

Gartner has an interesting piece about Data Lakes turning to Data Swamps – The Data Lake Fallacy: All Water and Little Substance … A must read …

The antidote is Data Curation. It needn’t be a very process heavy – have a consistent schema & publish them in a Wiki page. Of course, if you are part of a big organization (say retail) and have petabytes of data, naturally it calls for a streamlined process.

Data quality, data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline

2. Technology Stampede & Vendor Pollution

You start with a few machines, install Apache Hadoop and start loading data. Marketing fols catchup on your big data project and now you have a few apps running. A few projects come onboard and you have a decent sized cluster
A famous cloudy Hadoop vendor approaches the management and before you know it, the company has an enterprise license and you are installing the vendor’s Hadoop Manager & MPP database named after a GM car.
Naturally the management expects the organization to morph into a data-driven, buzz-compliant organization with this transition. And, naturally a Hadoop infrastructure alone is not going to suddenly morph the organization to a big data analytics poster child … vendor assertions aside …
And behold, the Pivotal guys approach another management layer and inevitably an enterprise license deal follows … their engineers come in and revamp the architecture, data formats, flows,…
- Now you have Apache, the cloudy Hadoop vendor and this Pivotal thing – all Hadoop, all MPP databases, but of course, subtle differences & proprietary formats prevent you from doing anything substantial across the products …
  - In fact even though their offices are few miles apart in US-101, their products look like they are developed by people in different planets !
While all is going on, the web store front analytics folks have installed Cassandra & are doing interesting transformations in the NOSQL layer …
Of course the brick and mortar side of the business use HBase; now there is no easy way to combine inferences from the web & store-walk-ins
One or two applications (may be more) are using MongoDB and they also draw data from the HDFS store
And, the recent water cooler conversations indicate that another analytics vendor from Silicon Valley is having top level discussions with the CIO (who by now is frustrated with all the vendors and technology layers) … You might as well order another set of machines and brace for the next vendor wave …

The antidote is called Architecture & a Roadmap (I mean real architecture & roadmap based on business requirements)

Understand that no vendor has the silver bullet and that all their pitches are at best 30% true … One will need products from different vendors, but choose a few (and reject the rest) wisely !

3. Big Data to Nowhere

Unfortunately this is a very common scenario – IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. A conversation goes like this …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

Business : … (unprintable)

The Antidote is very simple. Build the full stack ie bits to business … do the data management ie collect-store-transform as well as a few apps that span the model-reason-predict-infer

Build incremental Decision Data Science & Product Data Science layers, as appropriate … for example the following conversation is a lot better …

Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

IT: With the data we have, we only know that they comprise ~30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

Now, business got an app, not fully production ready, still useful. IT has proved the value of the platform and can ask for more $ !

4. A Data too far

This is more of a technology challenge. When disparate systems start writing to a common store/file system, the resulting formats and elements would be different. You might get a few .gz files, a few .csv files and of course, parquet files. Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …

5. Technology Maze

This is the flip side of the data format challenge (#4 above) This one is because of the technology stampede and the resulting incompatible systems where data is locked in.

Antidote : Integrated workflows.

Came across an interesting article by Jay Kreps. Couple of quotes:

” .. being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products. But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.”

“…The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.”

“..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

The solution : A pub-sub infrastructure (Kafka) + “surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams“

6. Where is the Tofu ?

This is succinctly articulated by Xavier Amatriain when he talked about the Netflix Prize:

It is very simple to produce “reasonable” recommendations and extremely difficult to improve them to become “great”. But there is a huge difference in business value between reasonable Data Set and great …

The Antidote : The insights and the algorithms should be relevant and scalable … There is a huge gap between Model-Reason and Deploy …

What says thou ? Pl add comments about the Anti Patterns that you have observed and lived thru!

Updates:

[Jan 3, 15] Interesting blog in Forbes “Four Common Mistakes That Can Make For A Toxic Data Lake” says :

Your big data strategy ends at the data lake
The data in your data lake is abstruse
The data in your data lake is entity-unaware
The data in your data lake is not auditable

[]

Ref:
[1] http://2buiart.deviantart.com/art/Swamp-Temple-Night-443413082

[2] https://www.flickr.com/photos/xdmag/1470243032/

Of Building Data Products

November 17, 2013 by ksankar

[Update 5/9/15] Lots of good pointers from “Everything We Wish We’d Known About Building Data Products”
- Data Products Need to Be Built Differently
- Keep Your data clean
- Give Data Back in a Powerful Way – But don’t confuse or overwhelm the users
  - The users have to feel safe
  - The users have to feel they are in control
- Never try to launch a complicated data product on a fixed schedule
[Update 11/28/13] Notes from blog by Jon “Data Driven Disruption at Shuttershock” on what a data products company is
1. Data is your product, regardless of what you sell
2. Data is your lens into your business – Jon echo’s Peter’s insights viz. invest in data access; feel the pulse of the business & iterate
3. Data creates your growth
Back to the main feature, Peter’s talk
A very insightful & informative talk by Peter Skomoroch of Linkedin via Zipfian academy
It is short & succinct, only 37 minutes. I urge all to watch
The slides of the talk “Developing Data Products” are at slideshare
Quick Notes:
- A Data Product understands the world through inferential probabilistic models built on data
  - So collecting right data through “thoughtful” data design is very important
  - The data determines & precedes the feature set & the intelligence of your app
    - LinkedIn is a prime example – as they get more data, the app has become more intelligent, intuitive and ultimately more useful
    - Offer progressively sophisticated products, leveraging the data & insights, across the different user population segments – customer segmentation & stratification is not just for retail !
- While more data, see “Unreasonable Effectiveness of Data” Distinguished Lecture by Peter Norvig, is good; for complex models, a deep understanding of the models and feature engineering would eventually be necessary (beyond the “black box”)
  - Data products about people, are usually complex, in terms of models as well as the data

[Update 12/13/13] Remember, a data product usually has the three layers – Interface, Inference & Intelligence

XLDB Conference at Stanford – Quotable Quotes

September 16, 2013 by ksankar

The Extremely Large Database/XLDB 2013 Conference & the invited Workshop at Stanford had lots of good speakers and extremely interesting view points. I was able to attend and participate this year.

Previously I wrote two blogs on presentations by Google’s Jeff Dean : and NEA’s Greg Papadopoulos

Here are the highlights from the presentations. Of course, you should read thru all the XLDB 2013 presentation slides.

Greg Papadopoulos : Make it Big by Working Fast and Small

September 16, 2013 by ksankar

Last week I attended the XLDB Conference and the invited Workshop at Stanford. I am planning on a series of blogs highlighting the talks. Of course, you should read thru all the XLDB 2013 presentation slides.

NEA’s Greg Papadopoulos had a view point on innovation and startups. Highlights in pictures. Of course, you should read thru the full presentation.

I really liked the “Common Characteristics Of Success”. Golden words indeed !

Scaling Big Data – Impermium

May 8, 2013 by ksankar

Came acorss an informative blog on scaling big data – “Built to Scale: How does Impermium process data?” Quick notes from the blog:

Don’t fall in love with a technology so much that you cannot be separated – Be flexible in scaling as you grow
- “Parting is such a sweet sorrow”, but change is an essential component of an infrastructure at scale
- The technology selection and consumption should be a continuous process, introducing new technologies as needed by the growth. I found Impermium’s path from grep to Solr to Elastic Search very illuminating; I have done the same before.
Technology needs are not static
- A corollary of #1 above – Growth on all parts of the stack will not be uniform.
- For example Impermium found scaling challenges in search and they moved to Solr & then to Elastic Search
There are no perfect technologies
- If you are doing interesting work, be ready to tango with open source code. This is essential – I also found this to be true.
- Even if you don’t plan to change the code, many times deep understanding comes from reading the code
Select technologies that you can dance with
- The flip side is that one should select technologies that you are comfortable working under the hood.
- In my case, while I love Erlang, I am not that comfortable with that language. So given a chance, I will go with Java or Scala
Benchmark is nothing but a story in a specific context
- So true. Benchmarks are transitory & personal.
- Understand them, but they need not be true for your transforms, your data model and your processing.
- Benchmark early & benchmark often … with your scenarions, models, transformations, mapreduces & data

Thanks Young for the short but very interesting blog. Keep up the good work …

Cheers

<k/>

Google Doodle Celebrating Lady Ada Lovelace

December 10, 2012 by ksankar

The Guardian says it all – “Mathematician and daughter of Lord Byron left legacy as role model for young women entering technology careers“.

Ada programming language is interesting. I have developed systems in the Ada language.

All the President’s DevOps

December 8, 2012 by ksankar

In the heels of “All the President’s Data Scientists” another interesting article on the Obama campaign’s cloud infrastructure.

Update : A similar article The Atlantic’s “When the Nerds Go Marching In”

Update : Case Study from New Relic How the Obama For America team improved resilience

They realized the campaign needed a scalable system “2008 was the ‘Jaws’ moment,” said Obama for America’s Chief Technology Officer Harper Reed. “It was, ‘Oh my God, we’re going to need a bigger boat.”
They build a single shared data tier with APIs to build lots of interesting applications. “Being able to decouple all the apps from each other has such power; It allowed us to scale each app individually and to share a lot of data between the apps, and it really saved us a lot of time.”
They leveraged internet architecture “We aggressively stood on the shoulders of giants like Amazon, and used technology that was built by other people,”
Doesn’t look like they used esoteric technologies. The system is built around Python APIs over RDS, SQS and so forth. Excellent and the fact that the systems can built this way is a testament to the cloud capabilities – IaaS & PaaS
In short Reed says it all “”When you break it down to programming, we didn’t build a data store or a faster queue. All we did was put these pieces together and arrange them in the right order to give the field organization the tools they needed to do their job. And it worked out. It didn’t hurt that we had a really great candidate and the best ground game that the world has ever seen.”

Facebook Infrastructure @ New Years Eve – A study in Scalability

December 31, 2011 by ksankar

Another interesting article on how Facebook is preparing for the New Year’s Eve, this time from our own San Jose Mercury News By Mike Swift.

Interesting points:

New Year is one of the busiest times for social network sites as people post pictures & exchange best wishes

CEO Mark Zuckerberg has long been focused on having the digital horsepower to support unbridled growth — are a key reason behind the .. network’s success

It received > 1 B photo uploads during Haloween 2010
Since then Facebook added 200 million more members and so New Year Eve 2012 can see more than 1.5 B uploads !
My favorite quote from the article:

The primary reason Friendster died was because it couldn’t handle the volume of usage it had. … They (Mark,Dustin and Sean) always talked about not wanting to be ‘Friendstered,’ and they meant not being overwhelmed by excess usage that they hadn’t anticipated

The engineers at Facebook just finished a preflight checklist and are geared up for the scale
In terms of scale “Facebook now reaches 55 percent of the global Internet audience, according to Internet metrics firm comScore and accounts for one in every seven minutes spent online around the world.”
From a Big Data perspective, Facebook data has all the essential proprieties viz. Connected & Contextual in addition to large scale – Volume & Velocity (see my earlier blog on big data)
Facebook has the “Emergency Parachutes” which let the site degrade gracefully (for example display smaller photos when the site is heavily loaded)
Their infrastructure instrumentation is legendary (for example, the MySQL talk here)

To manage Facebook’s data infrastructure, you kind of need to have this sense of amnesia. Nothing you learned or read about earlier in your career applies here …

My missives

Category Archives: Technology and Software

5 Lessons on AI from the Tesla Autopilot Fatality

Lesson 1 : Our machines inherit our faults (so far …)

Lesson 2 : Many domains are not forgiving to byzantine failures