The other day I was pondering the subject of a Data Scientist & model deployment at scale as we are developing our data science layers consisting of Hadoop, HBase & Apache Spark. Interestingly earlier today I came across two artifacts – a talk by Cloudera’s @josh_wills and a presentation by (again) Cloudera’s Ian Buss.
The talks made a lot of sense independently, but add a lot more insight – collectively ! The context, of course, is the exposition of the curious case of data scientists as devops. The data products need an evolving data science layer …
It is well worth your time to follow the links above and listen to Josh as well as go thru Ian’s slides. Let me highlight some of the points that I was able to internalize …
Let me start with one picture that “rules them all” & summarizes the synergy. The “Shift In Perspective” from Josh & the Spark slide from Ian
The concept of Data Scientist devops is very relevant. It extends the curious case of the Data Scientist profession to the next level.
Data products live & breath in the wild, they cannot be developed and maintained with a static set of the data. Developing an R model and then throwing it over the wall for a developer to translate won’t work. Secondly, we need models that can learn & evolve in their parameter space.
I agree with the current wisdom that Apache Spark is a good framework that spans the reason,model & deploy stages of data.
Other interesting insights from Josh’s talk.
The virtues of being really smart is massively overrated; the virtues of being able to learn faster is massively underrated
Well said Josh.
P.S: Couldn’t find the video of Ian’s talk at the Data Science London meetup. Should be an interesting talk to watch …
I came across an interesting talk by Google’s Peter Norvig at NASA.
Of course, you should listen to the talk – let me blog about a couple of points that are of interest to me:
Algorithms that get better with Data
Peter had two good points:
- Algorithms behave differently as they churn thru more data. For example in the figure, the Blue algorithm was better with a million training dataset. If one had stopped at that scale, one would be tempted to optimize that algorithm for better performance
- But as the scale increased, the purple algorithm started showing promise – in fact the blue one starts deteriorating at larger scale. The old adage “don’t do premature optimization” is true here as well.
- In general, Google prefers algorithms that get better with data. Not all algorithms are like that, but Google likes to go after the ones with this type of performance characteristic.
There is no serendipity in Google Search or Google Translate
- There is no serendipity in search – it is just rehashing. It is good for finding things, but not at all useful for understanding, interpolation & ultimately inference. I think Intelligent Search is an oxymoron ;o)
- Same with Google Translate. Google Translate takes all it’s cue from the web – it wouldn’t help us communicate with either the non-human inhabitants of this planet or any life form from other planets/milky ways.
- In that sense, I am a little disappointed with Google’s Translation Engines. OTOH, I have only a minuscule view of the work at Google.
The future of human-machine & Augmented Cognition
And, don’t belong to the B-Ark !
Data Science & the profession of a Data Scientist is being debated, rationalized, defined and refactored … I think the domain & the profession is maturing and our understanding of the Mythical Data Scientist is getting more pragmatic.
Now to the highlights:
1. Data Scientist is multi-faceted & contextual
- Two points – It requires a multitude of skills & different skill sets at different situations; and definitely is a team effort.
- This tweet sums it all
- Sometimes a Data Scientist has to tell a good business story to make an impact; other times the algorithm wins the day
- Harlan in his blog identifies four combinations – Data Business Person, Data Creative, Data Engineer & Data Researcher
- I don’t fully agree with the diagram – it has lot less programming & little more math; math is usually built-in the ML algorithms and the implementation is embedded in math libraries developed by the optimization specialists. A Data Scientist should n’t be twiddling with the math libraries
- I had proposed the idea of a Data Science Engineer last year with similar thoughts; and elaborated more at “Who or what is a Data Scientist?“
- The BAH Field Guide suggests the following mix:
- I would prefer to see more ML than M. ML is the higher from of applied M and also includes Statistics
- Domain Expertise and the ability to identify the correct problems are very important skills of a Data Scientist, says John Forman.
- Or as Rachel Schutt at Columbia quotes:
- Josh Wills (Cloudera)
Data Scientist (noun): Person who is better at statistics than any software engineer & better at software engineering than any statistician
- Will Cukierski (Kaggle) retorts
Data Scientist (noun): Person who is worse at statistics than any statistician & worse at software engineering than any software engineer
2. The Data Scientist team should be building data products
3. To tell the data story effectively, the supporting cast is essential
- As Vishal puts it in his blog,
- Data must be there & processable – the story definitely depends on the data
- Processes & buy-in from management – many times, it is not the inference that is the bottle neck but the business processes that needs to be changed to implement the inferences & insights
- As the BAH Field Guide says it:
4. Pay attention to how the Data Science team is organized
5. Data Science is a continuum of Sophistication & Maturity – a marathon than a spirint
- I am sure organizations understand this intuitively, but many times the understanding is not reflected in their actions.
- Simply Put:
- Descriptive = What Happened
- Reactive = Take corrective Actions for what happened
- Proactive = Take actions based on fixed predictions
- Adaptive = Dynamic actions based on learning Predictive Models, embedded business rules and augmented cognition
- Prescriptive = Actionable inferences based on Data Science Models
- Jeff Bertolucci has a quick blog on the Descriptive, Predictive & Prescriptive Analytics.
- Michael Wu, Chief Scientist at Lithium has a series of blogs on this topic
Both Jeff & Michael haven’t talked about the Adaptiveness. For example, recommendation systems (like collaborative filtering) constantly incorporate new data and “tweak” the running models
Let me stop here, I think the blog is getting long already …
- [Update 11/28/13] Notes from blog by Jon “Data Driven Disruption at Shuttershock” on what a data products company is
- Data is your product, regardless of what you sell
- Data is your lens into your business – Jon echo’s Peter’s insights viz. invest in data access; feel the pulse of the business & iterate
- Data creates your growth
- Back to the main feature, Peter’s talk
- A very insightful & informative talk by Peter Skomoroch of Linkedin via Zipfian academy
- It is short & succinct, only 37 minutes. I urge all to watch
- The slides of the talk “Developing Data Products” are at slideshare
- Quick Notes:
- A Data Product understands the world through inferential probabilistic models built on data
- So collecting right data through “thoughtful” data design is very important
- The data determines & precedes the feature set & the intelligence of your app
- LinkedIn is a prime example – as they get more data, the app has become more intelligent, intuitive and ultimately more useful
- Offer progressively sophisticated products, leveraging the data & insights, across the different user population segments – customer segmentation & stratification is not just for retail !
- While more data, see “Unreasonable Effectiveness of Data” Distinguished Lecture by Peter Norvig, is good; for complex models, a deep understanding of the models and feature engineering would eventually be necessary (beyond the “black box”)
- Data products about people, are usually complex, in terms of models as well as the data
[Update 12/13/13] Remember, a data product usually has the three layers – Interface, Inference & Intelligence.
5. Don’t implement a technology infrastructure but the end-to-end pipeline a.k.a. Bytes To Business
SImple Reason : Business doesn’t care about a shiny infrastructure, but about capabilities they can take to market …
4. Think Business Relevance and agility from multiple points of view
Aggregate Even Bigger Datasets, Scenarios and Use Cases
- Be flexible, tell your stories, leveraging smart data, based on ever changing crisp business use cases & requirements
3. Big Data cuts across enterprise silos – facilitate organization change and adoption
- Data always has been siloed, with each function having it’s own datasets – transactional as well as data marts
- Big Data, by definition is heterogeneous & muti-schema
- Data refresh, source of truth, organizational politics and even fear comes in the picture. Deal with them in a positive way
2. Build Data Products
The Extremely Large Database/XLDB 2013 Conference & the invited Workshop at Stanford had lots of good speakers and extremely interesting view points. I was able to attend and participate this year.
Previously I wrote two blogs on presentations by Google’s Jeff Dean : and NEA’s Greg Papadopoulos
Here are the highlights from the presentations. Of course, you should read thru all the XLDB 2013 presentation slides.