Data Science is the new Electronics


P.S: This is a copy of my blog in Linkedin.

Electronics

A good friend of mine asked me “What exactly is this Data Science”?

That got me thinking – we have tons of blogs on “Who or What is a Data Scientist” including mine.

One can explain the intuition behind Data Science, the pragmas of the profession, but not the essence !

Then I remembered an engineer on a flight to Tokyo, who was at 61G, I was 61H. It was years ago, probably a lot more years than many (or most) of the readers would remember. I asked him what he was doing and his answer was “Helping companies to embed electronics in their products!”. I remember when autos had no electrical circuits except for the lights. Then came ignition electronics, engine electronics and now powerful computers that control almost all functions; except, of course, to roll where we still need old-fashioned wheels & tires !

We are at that stage with Data Science, where the three Amigos of Data Science(Intelligence, Inference and Interface) can be embedded in enterprise systems increasing their capabilities that far exceed the current ones !

We can really build adaptive systems .. not descriptive, not reactive but truly adaptive, that have malleable intelligence instead of the brittle newtonian rules !

As Sonny Elliot would say – Exactically!

Exactically similar to Electronics some years ago ! Now is the time to think Data Science as embeddable modules with Intelligence/Inference at the systems level and interesting Interfaces for the users …

And that, probably, is the mission of Data Scientists …

If they choose to accept … This blog could self-destruct in 5 seconds …5…4…3…2

Data Science with Spark on the Databricks Cloud – Training at SparkSummit (East)


DataSci-03-P24We had a good Data Science training session in Sheraton, Times Square, NY; second day of SparkSummit (East). It was my privilege to co-author and lead the Data Science track, along with Reza, Paco, Andy, Hossein, TD,Joseph and Xiangrui. I have shared the slideset at Slideshare as well as at the Databricks site.

[Update 4/12/15] : The video is posted at Youtube (5hrs!)

This was the second time I was involved with a training fully based off of the Databricks cloud and it worked out very well ! The Databricks cloud was very robust and resilient. Unfortunately we had problems with the wireless at the Sheraton Hotel !DataSci-03-P27
The training was a mixture of hands-on and lecture.We sterted out with a dataset of 30 records and then moved onto the titanic dataset (900) to the movielens medium (1,000,000) and finally with the RecSyschallenge dataset (33,000,000!). What a progression in a day !

You can see the details in the slides. Ping me if you have any questions.

DataSci-03-P28Data wrangling over the RecSysChallenge 2015 data captures the essence of the Databricks cloud. I will quickly cover the RecSys Challenge dataset as an illustration.

The training data consists of 33,003,944 clicks and 1,150,753 buys. Our mission, if we choose to accept is to predict the session-items bought from a test dataset of 8,251,791 clicks.

A quick data exploration workflowdbc-01:

dbc-02

dbc-03

All at scale, in an elastic cloud, seamlessly moving between dev, model, stage and prod ! The magic of Databricks Cloud !

BTW, we also explored the State Of the Union Speeches from Washington, Lincoln, FDR, Clinton, Bush & Obama. The graphs below show a succinct view of the mood of the nation at each periods …

dbc-04

And finally after 100 slides later …!

DataSci-03-P100

The Art of NFL Ranking, the ELO Algorithm and FiveThirtyEight


In this blog, I will focus on the NFL Ranking based on the ELO algorithm that Nate Silver’s FiveThirtyeight uses. The guys at 538 have done a good job.The ELO and NFL ranking was part of my workshop at the Global Big Data Conference this Sunday. The full presentation is in slideshare


ELO – the algorithm made famous by Facebook & depicted in the movie Social Network

gbdc-r-04-P30


 Basic ELO

gbdc-r-05-P20

The k-Factor is the main leverage point to customize the algorithm for different domains.

  • For example Chess has no notion of a season; Soccer,Football & Basket ball are dependent on seasons – teams change during different seasons
  • Chess has no score to consider except WIn,Lose or Draw; but ball games have scores that need to be accommodated
  • For Chess k=10; for soccer it varies from 20 to 60; 20 for friendly matches to 60 for World Cup Finals
  • As we will see later, NFL adjusts k with the Margin Of Victory Multiplier
  • NFL also adjusts k to weigh recent games more heavily, w/ exponential decay
  • There are also mechanisms for weighing playoffs higher than regular season games (We will see this in Basketball)

538’s take on ELO

gbdc-r-05-P21

gbdc-r-05-P22


NFL 2014 Predicts & Results

The R program ELO-538.R is in Github

2014 Ranking Table

gbdc-r-05-P27

gbdc-r-05-P29

gbdc-r-05-P31

gbdc-r-05-P32


To Do

  1. Exponential decay with more weight for recent games – later in the season
  2. Calculate the rankings from 1940 to present, draw graphs like this from 538

Augmented Cognitive Intelligence


Have been working on this architecture for a couple of years. The idea is to build an AI machine that augments the human capabilities. I know IBM has Watson; Google, FB all have their own versions that address different domains.

The diagram below is more for my understanding and to clarify the thinking. I will write more as I get time. Hope you all find it useful.

AI-01

Business Users Shouldn’t touch Hadoop even with a 99-foot pole !


Yep, I know, it is 10 foot pole; and the origin is from “10-foot poles that river boatmen used to pole their boats with”[1]

Back to the main feature, I was reading an piece by Andrew J Burst at GigaOM that “Hadoop needs a better front-end for business users”

Yikes. This is terrible … I would argue, no, make that insist, that business users be kept as far away as possible from Hadoop (& similar frameworks)

Allow me to elaborate …

  • Business users do need highly interactive analytic dashboards with knobs & dials into our deep machine learning models and sliders onto our AI machines, No doubt.
  • We don’t want to abandon our beloved business users with static-rigid-newtonian-deterministic artifacts; we want them to have living, (fire) breathing intelligent-inferential-predictive-models

  • But that control & interactivity is into a business analytics beast that has multiple layers, not directly onto a Hadoop or hadoop-like system.

Also separating the “what” form the “how” by a declarative interface is very important

You see, analytics has at least four layers viz. Infrastructure, Intelligence, Inference & Interface

I4

  • Hadoop is Infrastructure, Spark is Infrastructure – The “How”
  • Machine Learning algorithms are Intelligence – Again lots of “How”
  • Models are Inference – the “What” Plus some “How”
  • Dashboard is the Interface (usually) – definitely the “How”
  • Interface can be recommendations, financial predictions, ad forecasts or even actual devices that interface to predictive models

  • And business needs knobs & dials at the Inference & Interface layers
    • The Infrastructure then appropriately fires frameworks Hadoop or Spark or Java or iPython …
  • Digging deeper, Hadoop itself has three layers – none of them operable by a business user, but real work horses

    • HDFS – the distributed File System

    • MapReduce – the distributed data parallel computation engine

    • HBase – the NOSQL data store

Back to Andrew’s points, Hadoop (and it’s cousins) should remain as a tool for the Chefs; but diners do need to express their choices and have the ability to “tweak” the seasonings, portions or even the amount of cooking; a declarative interface (which tells what but not how) comes from the domain specific menus catered by the restaurants which focus on respective culinary styles or even a fusion !

Now I am getting Hungry ! On my way to downstairs (am at the Hilton – NY Fashion District) to my favorite Chipotle – who in fact gives me the declarative freedom, without getting into their kitchen and the need to handle the saucepans ;o) It is better that way because I am terrible with cooking and spice measures – I can tell less salt but not the amount !


[1] http://en.wiktionary.org/wiki/not_touch_something_with_a_ten_foot_pole

[2] Interface from http://img1.mxstatic.com/wallpapers/1bb91493c637d7c5ed6e1cefbef87ec1_large.jpeg

Building a Data Organization that works and works with business


One thing that caught my attention on Netflix’s Neil Hunt’s interview with Gigaom was this :

A Data Organization that works & works with business

Well said. That explains Netflix’s data Science in a nutshell that all Data Scientists should emulate !

From a Chief Data Scientist’s perspective, I really like their way of looking at Data Science viz:

  • The folks who do data Science for the whole business
  • The folks who build algorithms & 
  • The folks who do data engineering

In fact I had a blog on this specialization of Data Science skills

Netflix is putting more weight on actual behavior ! Interesting, we are also seeking similar effects ie differentiate between falling asleep on a couch vs. actually watching a TV show !  It is hard inference … Netflix has the blocker, I have nothing ;o(

Binge watching … interesting … We are actually working on algorithms to figure that out and change the ad mix. I plan to talk more at the TM Forum Digital Disruption Panel on December 9th !

bdtc-py-18-P76Finally, the fact that importance of Netflix recommendation engine is underrated is so true. In many ways the recommendation algorithms and engines are core to many systems.

In fact, we have a reverse recommendation strategy ! We recommend users to ads !