Data Science Engineers – The new breed of Data Scientists ?


While there is lots of interesting discussions about Data Scientists, or lack there of. The role of Data Science in Big Data is well understood. I think the need is actually for Data Science Engineers. I had a set of pictures explaining this concept and interestingly came across a blog by HortonWorks on the topic of Data Scientists.

[Update May 19, 2013] An informative blog in the Wall Street Journal about Data Scientists by IBM’s Irving Wladawsky-Berger – Data Science is a multidisciplinary evolution from business intelligence & analytics. In addition to having a solid foundation in statistics, math, data engineering and computer science, data scientists must also have domain expertise.

DataScienceTeam

Image

Data Science Engineers 01-01

What says thee ?

Is our Neocortex a Giant Semantic Bloom Filter ? Of Natural Intelligence, Machine Learning & Jeff Hawkins


L’Apéritif:

Image

In a set of four lectures spanning about 3 years, Jeff Hawkins explains how & why big data can only be solved by evolutionary-adaptive-continuously-learning models incorporating principles from the working of Neocortex.
It does make sense – especially for NLP, NLU & Knowledge Representation. I am a big fan of the Borgs and their coordinated intelligence.

These are my annotated picture-notes …

L’Entrée:

Let me begin at the beginning. The other day I came across 4 very interesting talks by Jeff Hawkins on Biological Inspired Machine intelligence.

Call it serendipity because we have been looking for more effective ways for Knowledge Representation (KR) & Natural Language Understanding (NLU)

For example movie names, while very easy for humans to understand, a MaxEnt NER finds it very hard.  Knowledge Representation & Association is more harder !

We are experimenting with a few techniques like word-based tries (ie. spell-check sentences by words), higher order federated Bloom Filters and n-gram hashing. Planning to incorporate some of Jeff’s ideas …

I digress … Topics for another day … back to Jeff & Machine Intelligence …

Very inspiring, extremely thought provoking talks – as usual the inimitable Jeff Hawkins at his best

  1. Google Tech Talk : Jeff Hawkins, “Building Brains to Understand the World’s Data
  2. UC Berkeley Graduate Lectures
  3. “Advances in Modeling Neocortex and its impact on Machine Intelligence” by Jeff Hawkins,  Smith Group Lecture presented at the Beckman Institute for Advanced Science & Technology at the University of Illinois at Urbana-Champaign

Le Plat Principal:

The four talks have lot of depth and are packed. Moreover Jeff talks very fast – I listened to the talks a few times – at least 3 hrs per one hour talk. You should listen to them slowly & rewind as reqd. It takes a few hours to get one’s head around the various ideas.

Let me annotate a few of his slides – those I was able to internalize to some extent:

Focus & premise[3]:

Hawkins-100-02-01

The assertion, that many problems can only be solved by incorporating principles from the working on Neocortex, is interesting.

BTW, it does make sense – especially for NLU & Knowledge Representation.

As Jeff mentions later, the behavior need not be human-like, but the representation, interpretation & “understanding” would be.

Neocortex Architecture[3]:

“Neocortex is just a sheet of cells  2mm thick, the size of a dinner napkin” – Amazing what it can do!

Hawkins-100-03-01

The Six Principal Essentials of Biological Intelligence

The picture says it all.

Hawkins-100-04-01

Learning involves training and adaptive connections

Hawkins-100-05-01

The concept of streaming events & the learning mechanisms

Patterns from complex data streams

Hawkins-100-06-01

The paper “Hierarchical Temporal memory” has the gory details about the Hierarchical Temporal Learning.

Future

Hawkins-100-09-01

Interesting observation: Emotion, the fundamental aspect of being human, is not a requirement for intelligence – reminds us of Spock, of course.

Machine intelligence is not about replicating human behavior or even passing the turing test. I agree on this – we need the machines to think & do things we cannot do thus augmenting us. Make us stronger where we are weak !

Le Digestif

What interested me most was the sematic knowledge representation, NLP & NLU. The ability to understand and store concepts, the capacity to generalize as well as the mechanisms of strengthening and weakening connections based on external signals – just beautiful …

Agree that the Sparse Distributed Representation could be the language of all the intelligent machines.

The SDR looks a lot like a giant Bloom Filter

Hawkins-100-10-01

Hawkins-100-11-01The planes can be considered as rows and a column as the temporal dimension of the semantic mapping (the memory of sequences). Which equates to a giant n-dimensional Bloom Filer – a data structure we can grok (Pun intended as Jeff’s product is called Grok!).

The bloom filter analogy, while extremely simplistic, is conceptually congruent, in the sense that “similar values have similar representation”, of course depending on the hash algorithm.

After listening to the talks and thinking them over, I have a thousand questions in many directions. I will post the answers as we develop this through for our needs. Please send in your insights as comments to this blog. AM sure it will help a few folks !

Hawkins-100-12-01

  1. How do we handle semantic categories ? 
  2. How do we build more sophisticated representations based on spatial patterns ?
  3. What is the hash function that maps a slice of semantic to this giant Bloom Filter ?
  4. How does it handle collision? Corruption ? Clustering for resiliency/self adjusting representation ?
    • Collision might be good and I think that is what Jeff calls as semantic generalization
  5. How does the semantic slice mapping function differentiate between a search & computation to trigger appropriate actions?
    • For example the following two questions require different actions: 
      • What is stock price of IBM ?” vs.
      • What is the volatility as reflected in the beta of IBM for this quarter ?” 
      • The first one is a search while the second has computation …
  6. Is the hash function same for all of us or is it different for each person ?
    • Most probably the function is a learned artifact.
  7. Another interesting vector is the Hierarchy & higher patterns of temporal coalescence/slowness – the high-order capability, tweaking the learning rates across the layers.
    • How can this be modeled with the analytical data structures we have?
    • And what are the mechanics for stable representation of pattern sequences – because with dynamicity and temporality comes the difficulty of snapshots and consistency between them.
    • The unique representation of the same sequence, at a later time in context of the earlier invocation is interesting …
  8. How do we “put a classifier on the top” ?
    • Play with permanence? Probability?
  9. What are the algorithms to prevent run away prediction?
    • I agree that we could account for rapid state difference vs. slower state; we still will have to encapsulate it in some form of code

Finally, can we build “Amazingly Intelligent Machines?” Yes We can !

And agree with Jeff that “It is essential, for the survival of the spices, that we build them” …

The Big Data Convergence


As we scan the concepts, technologies, products and the practices in the big data space, lot of things get muddier.

Neither the progression nor the boundaries are clear. We are still in the descriptive stage in terms of the application of the analytics technologies.

I had a good conversation with Bob Friday yesterday – his question was “What prevents us from answering 80% of the questions via automatic inferences ?” And that is the “Adaptive” stage we need to be …

I think a diagram is much better than me writing 100,000 words. So here it is :

Image

In many ways, a lot of the underlying technologies are converging.

For example, A(rtificial) I(ntelligence) = NLP + N(atural) L(anguage) U(nderstanding) + ML + K(nowledge) R(epresentation) + Reasoning
Are Amazing Intelligent Machines in the works ?

Big Data State Of The Union


An informative study by TCS on the current state of Big Data “The Emerging Big Returns on Big Data”

.

Image

Of course, you should download and read the whole report. Some interesting highlights:

  • There’s a polarity in spending on Big Data, with a minority of companies
    spending massive amounts and a larger number spending very little
  • The business functions expecting the greatest ROI on Big Data are not the ones
    you may think – while Sales & Marketing have initiatives, finance & logistics are betting on big data for efficiences & insights
  • The biggest challenges to getting business value from Big Data are as much
    cultural as they are technological
  • Nearly half the data (49%) is unstructured or semi-structured, while 51% is
    structured. The heavy use of unstructured data is remarkable given that
    just a few years ago it was nearly zero in most companies – Enterprises have gone multi-structured !
  • Monitoring how customers use their products to detect product and design
    flaws is seen as a critical application for Big Data

Cheers & Happy Reading …

5 Steps to Pragmatic Data …er… Big Data


It is 2013 & Big Data is big news … Time to revisit my older (Nov’11) blog “Top 10 Steps to A Pragmatic Big Data Pipeline” … Some things have changed but many have remained the same …

5.  Chuck the hype, embrace the concept …

This seems to the first obvious step for organizations. From Ed Dumbill (“Big data” is an imprecise term...) to TechCrunch (“Perhaps it’s about the actual functionality of apps vs. the data“) agree with the concept, but the terms and marketing hypes have hit the proverbial roof. The point is, there are many ponies this pile & there is tremendous business value (so long as one is willing to discount the hype and think Big Data = All Data) …

I really like Mike Gualtieri’s very insightful definition of Big Data as

… the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers

Big Data 01

4. Don’t implement a Technology, implement THE Big Data pipeline

Think of Big Data in multiple dimensions than a point technology & evolve the pipeline focussing on all the aspects of the stages

Data Science 02

The technologies, the skill sets and the tools are evolving, so are the business requirements.

Chris Taylor addresses this very clearly (“Big Data must not be an elephant riding a bicycle“) – viz. One has to address the entire spectrum to get value …

Simply applying distributed storage and processing (like Hadoop) to extremely large data sets is like putting an elephant on a bicycle .. it just doesn’t make business sense — Chris Taylor

3. Think Hybrid – Big Data Apps, Appliances & Infrastructure

I had addressed this one in my earlier blog(“Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms“)

The morale of the story : Think out-of-the box & inside-the-box.

Match the impedence of the use cases with appropriate technologies

2. Tell your stories, leveraging smart data, based on crisp business use cases & requirements

Evolve the systems incrementally focussing on the business values that determine the stories to tell, the inferences to derive, the feature sets to influence & the recommendations to make

Augment, not replace the current BI systems

Notice the comma (I am NOT saying “Augment not, Replace”!)

“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. In fact, integration with BI is an interesting challenge for Big Data …

No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL

From a business perspective Patrick Keddy of Iron Mountain has a few excellent suggestions on managing Big Data: 

Big data informs and enhances judgement and intuition, it should not replace them

Opt for progress over perfection

View the data in context

1. Apply the art of Data Science & Smart Data, paying attention to touch points

This still remains my #1. Data Science is the key differentiator resulting in new insights, new products, order of magnitude performance, new customer base et al – “a cohesive narrative from the numbers & statistics”

Data science is about trying to create a process that allows you to create new ways of thinking about problems that are novel, or you are trying to use data to create or make something.” says D.J.Patil

Smart Data = Big Data + context + inference + declaratively interactive visualization

smartData02

  • Smart Data is (inference) model driven & declaratively interactive
  • For example,
    • The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
    • Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data;  a gam or glm over the log data after running through several stages of MapReduce is smart data, because it could give you an insight as to what factors or combination of factors make a good phone a bad phone

Focus not only on the Vs (ie Volume,Velocity,Variability & variety) but also on the Cs (ie. Connectedness & Context)

The two main Big Data challenges in 2013 would be:

1st : Data integration across silos to get the comprehensive view &

2nd : Matching the real-time velocity of business viz. CEP, sense & respond et al.

 For example, I have already seen folking looking outside Hadoop for CEP and near-realtime response

“.. 85% of respondents say the issue is not about the volume of data but the ability to analyze and act on data in real timesays Ryan Hollenbeck quoting a 2012 Cap Gemini study (Italics mine)

Big Data Borgs, Rise of the Big Data Machines & Revenge of the Fallen Algorithms


I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :

What or Who is a Data Scientist ?


DataScientist = Part Hacker + Part Technologist + Part Detective + Part Scientist + Part Business Analyst + Part  Visual Artist

[Update 12/22/12] GigaOm says : Data Science = Data Architecture + Machine Learning + Analytics. Makes sense. I have updated my diagram accrdingly
Data Science 02
DataScienceTeam

All the President’s Data Scientists


The Times Election Commemorative Edition has an interesting article on the role of Data Science “Inside the Secret World of the Data Crunchers Who Helped Obama Win“. A few quick lessons (Of course, you should read the full Times article):

[Update 2/14/13] Infoworld has an interesting take on Big Data Analytics and the Obama Campaign. In addition to the Time’s narration of 4 lessons, InfoWorld adds the following:

  • Combined efforts of Analysts & Engineers
  • Implemented in weeks than months
  • Built around unconstrained, yet centralized environment (This is important for big data)
    • This enabled the analysts to ask questions irrespective of wherever the data originated from
  • Continuous inprovement, with built-in feedback loop

Note : I discuss the 5 Pragmatic Steps for Data …. er… Big Data in another blog
[update 2/28/13] AWS case study “Obama For America” has interesting details

My blog “All the President’s DevOps” on the infrastructure side of this system

1. Elevate Data Science to a 1st class Citizen

  • Campaign manager Jim Messina had promised a totally different, metric-driven kind of campaign in which politics was the goal but political instincts might not be the means. “We are going to measure every single thing in this campaign” … And hired a team of Data Scientists headed by Rayid Ghani
  • Rayid had visited Stanford to recruit budding Data Scientists – I wanted to attend, but couldn’t; am sure they would have also visited other campuses
  • Exactly what that team of dozens of data crunchers was doing, however, was a closely held secret. “They are our nuclear codes,” as the campaign guarded what it believed to be its biggest institutional advantage over Mitt Romney’s campaign: its data.

2. Collect, Unify & Leverage Big Data

  • As I had written in one of my earlier blogs, the spectacular results of Data Science (inference and predictions) come from an effective data pipeline.
  • The Obama campaign has interesting pipelines of big data streams
  • While 2008 campaign was very successful, the team realized that they had too many databases & “None of them talked to each other.
  • So over the first 18 months, the campaign merged the information collected from pollsters, fundraisers, field workers and consumer databases as well as social-media and mobile contacts with the main Democratic voter files in the swing states. — Brings tears to the eyes of a data architect!
  • They actually built an awesome data mining infrastructure
  • This “megafile was the foundation for simulation runs for contributions, “persuadability” analysis and so forth

3. Practice Metric-driven Data Science

  • Don’t be afraid to create bold models, but back them up with reality
  • The Data Scientists have developed interesting models & predictions, but tested them with e-mails with different subjects, monitor results from e-mail and phone campaigns et al.
  • … assumptions were rarely left in place without numbers to back them up

4. Effective Modeling comes from weaving Big Data & Live Data

  • “The analytics team used four streams of polling data to build a detailed picture of voters in key states”
  • The polling and voter-contact data were processed and reprocessed nightly to account for every imaginable scenario.
  • We ran the election 66,000 times every night

And finally the article ends with an insightful statement,

In politics, the era of #BigData has arrived !

I rest my case with one more observation on Obama’s Digital Gurus

Inside The Cave report is a good read

Cheers

<k/>

Reference:

Book Review – In the Plex : How Google Thinks, works and shapes our lives


Prelude:

I liked the book a lot, it reads like a thriller- at least to me. I couldn’t put it down and was reading the book late night, during work days – to the chagrin of the family !

Stephen Levy has clearly chronicled Google’s ascend and the tribulations it encountered – internal and external, on the way. What is more interesting is the fact that he has written a set of very crisp & detailed explanation of the innovations that Google brought into the search & advertisement domains.

I agree with Stephen that Google is a “clever internet-startup-named-after-a-100-digit-number turned into a corporate phenomenon”. It is very interesting to read it’s agony to IPO (and the ecstasy of the investors!) If Google had it’s way it would have added a requirement of min SAT score (and a Stanford PhD – at least an MMDS Certificate) for buying it’s shares ! Am forced to quote Scott Reeves (Forbes Aug 2004) on Google’s targeted price of $108/$135 “Only those who were dropped on their head at birth [will] plunk down that kind of cash for an IPO” – ouch ! (I myself was ready for around $50)

Google – A sum of it’s Obsessions

Search (Of course!)

  • PageRank, of course, refers to Larry Page’s Ranking Algorithm ! The PageRank estimates the importance of a page by the web pages that link to it. “We convert the entire web into a big equation with several hundred million variables”
  • The concept of signals – viz factors like terms, capitalization, font size, position et al – as traits added with PageRank is the secret sauce that made Google’s search very effective.
  • The search engines get major and minor rewrites “like changing the components of a flying plane – without the passengers knowing about it, but the ride becomes more comfortable and they get there faster “ not a perfect analogy but an effective simile!
  • The engineers fret about any queries that do not get answered in the first page – in many ways clicking next page in a search result is a failure of the brilliant engineers behind the search engine. You have to read about the query “Audrey Fino” that vexed Amit Singhal Google’s chief of search engine. The search showed lots of Audrey Hepburn and that bothered Amit – “There’s a person somewhere names Audrey Fino and we didn’t have the smarts in the system to know this” and the remedy was of course – to state Stephen,  a multi-year name detection and name classifier “algorithmic therapy” with a dash of “bigram breakage” added to taste !
  • Rokc is rock unless it has little in front of it (when it becomes the capital of a state) or if preceded by Noah becomes ark ! Another such query was “Eika Kerzen” which requires translation (to German in this case) to get to the right search result.

Algorithmic purity & ubiquitous

  • Google is an algorithmic company driven by computer science ! We can see that everywhere – successes and failure. For example the number of shares at IPO (2,718,281,828) is the Napier’s constant e ! During the bidding of patents, Google was bidding numbers like pi for Nortel’s patents
  • Even the Google ad sales people consider themselves as mediators between madison avenue and algorithms – only Google can say with both the words in the same sentence, make it sensible and in the process create an industry where it makes billions of dollars – as one SEO chief puts it “It is not we want to put all our boxes in one basket,  but there is only one basket in the industry”
  • The great lengths the Google team would go to make search relevant is exemplified by the “running shoes gnome sculpture”. The engineers believed in algorithmic purity – and before the launch of the Froogle product search, “running shoes” would show a “garden gnome sculpture that happened to wear sneakers”. The team cannot ship a product that fails to differentiate between a lawn art and a footwear. It seems within a couple of days the offending link disappeared ! And the team learned that one of their teammates went ahead and bought the one-of-a-kind sculpture that taking it off the web site !   “The algorithm started showing the right results, … and we launched!”
  • Search algorithmics sometimes had very strange effects – like showing the now defunct main office of bell telephone for a query “weather.com Philadelphia” – reason being the telephone company used to tell weather over the phone and this factoid was unearthed by the search algorithm !
  • It is interesting to read how Google re-invented the bidding system “Vickery second-bid action system” because the engineer (Eric Veach) wanted to avoid the “bid shading”. In the end, like anything else that Google touches, they created an innovative system that combined a few factors like bidding and ad-positioning, adding competition & customer satisfaction, in the end creating a rolling revenue stream in the order of billions of dollars for Google  – all in all a nifty feat!
  • The concept of compressing data to understand it was a brilliant stroke – the Google project called Phil (Probabilistic Hierarchical Inferential Learner) resulted in understanding the essence of web pages and …. Contextual matching ads with the web page’s content service called “Google content-targeted advertising” which later became AdSense (after acquiring the company Applied Semantics!)

Scale

  • Their success of algorithms (gave Eigen vector some credence) and  the change of scale that came with that was what made Google Google ! As Luiz Barrozzo observed “There are programs that do not run on anything smaller than a 1000 machines,  which means you are looking at a datacenter as a computer “
  • Google affects whatever it touches in unpredictable ways – for example, Google’s racks maxed out (power & cooling) at Exodus that Exodus drove an 18-wheeler upto the colo, punched 3 holes in the wall and pumped cold air into Google’s cage through PVC pipes!

The movers

  •  As I was reading the book, there were a few people I knew who played prominent roles in Google – was wondering when Hal Varian would show up – he did in (P.116) and stayed relevant in a lot of pages with his team of “econometricians” cross between statisticians and economists !
  • Was wondering when Sundar Pichai would show up, he did (P.205) and remained relevant as Steven narrated eloquently the advent of Google Chrome and the JavScript engine V8 … leveraging Google’s insistence on speed …
  • Stephen has interviewed most of, if not all, the technology leaders and we get to meet them at the relevant topics.

Trivia:

  • I think building 40 is called Building 0 or Nullplex. It is interesting as I work nextdoor – the only non-google building among the sea of bicycle trotting Googlers !
  • Pages Law according to Brin – “Every 18 months, software becomes twice as slow” !
  • Danger, which Andy Rubin cofounded, moved into the Palo Alto office when Google moved out of it in 1999 ! Eventually he left Danger and started Android …
  • Google always was structured like a PhD program dorm in a university – as Andy Rubin puts it “There is an implied grading on a 4.0 scale of the questions during interview and anybody less than 3.0 is rejected; the GPS (Google Product Strategy) meetings are run like a PhD defenses”
  • As told by Alan Eustace to Andy Rubin “Google’s brain is like a baby’s – an omnivorous sponge that was always getting smarter from the information it soaked up”!
  •  “We want Google to be that third half of your brain – Sergey, P.386
  • “It’s quite amazing how the horizon of impossibility is drifting these days” Thurun
  • The locus and trajectory of Google –“put Google in the driver ‘s seat on many decisions – large and small – that people make the course of a day and their lives![P.68]

Epilogue:

  • In this review I touched only a minimal set of interesting points (interesting to me!). The book has a lot of good read from Google’s China syndrome to how the Googlers shaped the last presidential election and later worked for the Obama administration to the controversies like Goggle view and the struggle with digitizing books.
  • One important development that Stephen couldn’t include, due to the timing of the release of the book, was Google+. But don’t despair – Stephen has written that part of the story as an article in wired ! Best to read it after finishing the book.
  • Readwriteweb has an article on the data scientist behind Google+
  • And Stephen’s blog on Motorolla Mobility purchase is another good read, again an important step by Google.
  • I just now saw a write up by infoworld on Google’s 5 biggest hits and misses.
  • Next book on my reading list “I’m Feeling Lucky: The Confessions of Google Employee Number 59″ by Douglas Edward; it is on hold 3 of 7 from San Jose public library.