The other day I was pondering the subject of a Data Scientist & model deployment at scale as we are developing our data science layers consisting of Hadoop, HBase & Apache Spark. Interestingly earlier today I came across two artifacts – a talk by Cloudera’s @josh_wills and a presentation by (again) Cloudera’s Ian Buss.
The talks made a lot of sense independently, but add a lot more insight – collectively ! The context, of course, is the exposition of the curious case of data scientists as devops. The data products need an evolving data science layer …
- Josh was talking about Building a Production Machine Learning Infrastructure as well as Data Scientist in the lab vs. Data Scientist in an operational setting; at Cerner’s Tech Talk series
- Ian Buss Gave the talk about the Apache Spark as a hit with the Data Scientists because it allows them to span between Investigative & Operational aspects; his talk was at the Data Science London Meetup
It is well worth your time to follow the links above and listen to Josh as well as go thru Ian’s slides. Let me highlight some of the points that I was able to internalize …
Let me start with one picture that “rules them all” & summarizes the synergy. The “Shift In Perspective” from Josh & the Spark slide from Ian
The concept of Data Scientist devops is very relevant. It extends the curious case of the Data Scientist profession to the next level.
Data products live & breath in the wild, they cannot be developed and maintained with a static set of the data. Developing an R model and then throwing it over the wall for a developer to translate won’t work. Secondly, we need models that can learn & evolve in their parameter space.
I agree with the current wisdom that Apache Spark is a good framework that spans the reason,model & deploy stages of data.
Other interesting insights from Josh’s talk.
- “Do the simplest thing that could possibly work” and then build complex models as needed
- Log everything. Josh mentions an excellent blog by Jay Kreps on everything one wants to know about logs & was afraid to ask !
- If you have meaningful problems to solve & an env for people to iterate on these problems, data scientist’s will find you !
The virtues of being really smart is massively overrated; the virtues of being able to learn faster is massively underrated
Well said Josh.
P.S: Couldn’t find the video of Ian’s talk at the Data Science London meetup. Should be an interesting talk to watch …