I came across a good diagram depicting the Big Data Eco System companies. I wanted to overlay the products as well as the Data management/Data Science pipeline viz Collect-Store-…. Here is my diagram. Let me know how I can improve it and what it is missing.
I have been following the 2013 predictions for Big Data. Naturally lots of interesting predictions. Here are a few that I understand and (sort of) agree :
- Big Data Borgs & Rise of the Big Data Machines: Big Data apps & Learning Machines will rules says Information Week’s Jeff Bertolucci
- The era of Big Data automation w/ purpose-built BigData robots is upon us – predicts Derrick Harris @gigaom
- [Update 1] “… New specialized ARM-based servers to do specialty analytics. Advantage being more performance in smaller footprint, with reduced power requirements” says David Cappuccio
- Smart Big Data Appliances can also reduce operational complexity given “… for every 25% increase in functionality of a system, there is 100% increase in complexity” again from Gartner’s David Cappuccio
- The Revenge of the Algorithms : Big Data Algorithm Wars will begin, says Forrester’s Mike Gualtieri
- I agree with Mike that Big Data = all data
- Mike’s Definition of Big Data is very insightful & right on the dot
… the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers
- Also Mike is right in saying that Real-Time architectures will be prominent. In my discussions with companies, this need already is on the forefront. A well-crafted analytic infrastructure using Hadoop has a real-time least count of ~15 min (Eg. Facebook’s infrastructure)
- Big Data = [Business + Technology] [Requirements + Users] : Big Data will move out of sandbox with crisp business and technology requirements says John Bantleman @wired
- [Update 1] The Era of Smart Data : Actually the concept of Smart Data = Big Data + context + inference + declaratively interactive visualization has been there for a couple of years
- IBM’s James Kobielus & colleagues are big on cross-scale architectures & cross-application deployments. I agree.
- “… big data is about sensing, algorithmic discovery and gaining deeper insight through data” says Ed Dumbill in Forbes. Well said.
- Ed continues to say that “Essentially, the emergence of a global digital nervous system” … “
- I am neutral about it … Feedback loop is good, interlinked data is good, but taking Big Data as the digital nervous system … well that makes me nervous ;o)
- [Update 1] The era of Pragmatic Big Data : Companies have to decouple the marketing hype from the facts on the ground. Do what works for you, not what the hype says. The Top 10 Steps for a Pragmatic Big Data remains almost the same as when I wrote the blog an year ago ! Have an update “5 Steps To Pragmatic Data …er… Big Data” … What says thee ?
- In many ways Data Scientist is an elusive role. As we internalize the application of big data, the role and the skill sets would evolve …
- The two important questions are – “What would a Data Scientist do ?” & “What tools/skillsets does a Data Scientist have in one’s backpocket ?”
- Forbes article “The Where, Who, and Why of Data Scientists” is an interesting read
- [Update 1] Irving Wladawsky-Berger talks about Data Scientists having hybrid set of IT skills. Interesting blog.
- [Update 2] To those who came here directly, a forward pinter to my short blog on Data Science Engineers. To others, this will put you in a loop ;o)
- [Update 3] Defining & Differentiating the role of a Data Scientist by Doug Laney is interesting
- Defying the conventional wisdom of Data Scientist = PhD in Statistics, I think Data Scientist is much broader, but with focus on Modes, Algorithms, Visualization, Prediction & Exploration of Big Data, Of course.
- Here is my quick take – an equation and a picture to go with it !
DataScientist = Part Hacker + Part Technologist + Part Detective + Part Scientist + Part Business Analyst + Part Visual Artist
[Update 12/22/12] GigaOm says : Data Science = Data Architecture + Machine Learning + Analytics. Makes sense. I have updated my diagram accrdingly
Two things happened today for me to ask this question and am not offering any serious answer, yet !
- First I had some quick chat with folks at MongoDB and as a result got me thinking about the MapReduce in Mongo and where could it go. MongoDB also has the new declarative aggregation framework.
- My thesis is that, while now the MongoDB aggregation framework is JSON semantics+$ keywords, it could look a lot like a functional programming language – with high-order declarative functions like map/reduce, discriminated unions (like F#) and currying.
- And later in the day I read Edd’s blog “5 Big Data Predictions”, also in Forbes. (While both are the same blog, there might ne interesting comments in each)
- Lots of interesting observations from Edd. He is predicting better programming language support, but may be we are looking at it the wrong way – what we need is a better stored procedure support in the data layer. It also could the next point Edd was talking about-Streaming data processing ! Where best could we have that feature than at the data layer ?
- Would we be able to write a social science data platform using the MongoDB aggregators ? Would MongoDB mapReduce fit the bill now ? If not, what would it take to make it so ?
- There are two obvious paths – connector to an application artifact for example Hadoop connector or embed the map/reduce in the data layer. Both have their advantages and disadvantages. With the connector the mapReduce can scale orthogonally, but with the embedded feature, one can achieve real-time processing (within limits). May be this is the time for an application data store !
- Would the datastores like MongoDB gain features like the Twitter Storm, Real-Time map reduce, hierarchical iterative functional aggregators and so forth ?
- GreenPlum’s Chorus is interesting – Can NOSQL datastores gain some of the relevant capabilities that Chorus has?
Finally, the beginning as the end,
- Is hadoop the new stored procedure or would the new stored procedures look like Hadoop ?
- Is Data and Application becoming inseparable at scale ?
- What says thee?
As you know Big Data is capturing lots of press time. Which is good, but what does it mean to the person in the trenches ? Some thoughts … as a Top 10 List :
[update 11/25/11 : Copy of my guest lecture for Ph.D students at the Naval Post Graduate School The Art Of Big Data is at Slideshare]
10. Think of the data pipeline in multiple dimensions than a point technology & Evolve the pipeline with focus on all the aspects of the stages
- While technologies are interesting, they do not work in insolation and neither should you think that way
- Dimension 1 : Big Data (I had touched upon this in my earlier blog “What is Big Data anyway“) One should not only look at the Volume-Velocity-Variety-Variability but also at the Connectedness – Context dimensions.
- Dimension 2 : Stages – The degrees of separation as in collect, store, transform, model/reason & infer stages
- Dimension 3 : Technology – This is the discussion SQL vs. NOSQL, mapreduce vs Dryad, BI vs other forms et al
- I have captured the dimensions in the picture. Did I succeed ? Let me know
9. Evolve incrementally focussing on the business values – stories to tell, inferences to derive, feature sets to influence & recommendations to make
Don’t get into the technologies & pipeline until there are valid business cases. The use cases are not hard to find, but they won’t come if you are caught up in the hype and forgrt to do the homework and due diligence …
8. Augment, not replace the current BI systems
Notice the comma (I am NOT saying “Augment not, Replace”!)
“Replace Teradata with Hadoop” is not a valid use case, given the current state of the technologies. No doubt Hadoop & NOSQL can add a lot of value, but make the case for co-existence leveraging currently installed technologies & skill set. Products like Hive also minimizes barrier to entry for folks who are familiar with SQL
7. Match the impedance of the use case with the technologies
The stack in my diagram earlier is not required for all cases:
- for example if you want to leverage big data for a Product Metrics from logs in Splunk, you might only need a modest hadoop infrastructure plus an interface to existing dashboard plus Hive for analysts who want to perform analytics
- But if you want Behavioral Analytics with A/B testing with a 10min latency, a full fledged Big Data infrastructure with say hadoop, HDFS, HBase plus some modeling interfaces, would be appropriate
- I had written an earlier blog about the Hadoop infrastructure as a function of the degrees of separation from the analytics end point
6. Don’t be afraid to jump the chasm when the time is appropriate
Big Data systems have a critical mass at each stage – that means lots of storage or may be a few fast machines for analytics, depending on the proposed project. If you have done your homework from a business and technology perspective, and have proven your chops with effective projects on a modest budget, this would be a good time to make your move for a higher budget. And when the time is right, be ready to get the support for a dramatic increase & make the move …
5. Trust But Verify
True for working with Teenagers, arms treaty between superpowers, a card game, and more closer to our discussion, Big Data Analytics. In fact, one of the core competency of a Data Scientist is a healthy dose of skepticism - said John Rauser [here & here] . I would add that as you rely more and more inferences to a big data infrastructure across the stages, make sure there are checks and balances, independent verification of some of the stuff the big data is telling you.
Another side note in the same line is the oscillation – as the feedback rate, volume and velocity increases there is also a tendency to overreact. Don’t equate the feedback velocity to the response velocity – for example don’t change your product feature set based on high velocity big data based product metrics, at a faster rate than the users can consume. Have a healthy respect for the cycles involved. For example I came across an article that talks about fast & slow big data – interesting. OTOH, be ready to make dramatic changes when you get faster feedbacks that indicate things are not working well, for whatever reason.
4. Morph from Reactive to Predictive & Adaptive Analytics, thus simplifying and leveraging the power of Big Data
As I was writing this blog, came across Vinod Khosla’s speech at Nasscom meeting. A must read – here & here. His #1 and #2 in ‘cool dozen’ were about Big Data! The ability to infer the essentials from an onslaught of data is in fact the core of a big data infrastructure. Always make sure you can make a few fundamental succinct inferences that matter, out of your infrastructure. In short deliver “actionable” …
3. Pay attention to How & the Who
Edd wrote about this in Google+. Traditional IT builds the infrastructure for Collect and Store stages in a Big Data Pipeline. It also builds and maintains infrastructure for analytics processing, like Hadoop and visualization layer like Tableau. But the realm of Analyze,Model, Reason and the rest, requires a business view, which a Data Analyst or a Data Scientist would provide. Pontifying further, it makes sense for IT to move in this direction by providing a ‘landing zone’ for the business savvy Data Scientists & Analysts and thus lead the new way of thinking about computing, computing resources and talents …
2. Create a Big Data Strategy considering the agility, latencies & the transitory nature of the underlying world
[1/28/12] Two interesting articles – one “7 steps for business success with big data” in GigaOm and another “Why Big Data Won’t Make You Smart, Rich, Or Pretty” in Fast Company. Putting aside the well published big data projects, a normal big data project in an organization has it’s own pitfalls and opportunities for success. Prototyping is essential, modelling and verifying the models is essential and above all have a fluid strategy that can leverage this domain …
<This is WIP. Am collecting the thoughts and working thru the list – delibeately keeping two slots (and may be one more to make a baker’s dozen!, pl bear with me … But I know how it ends ;o)>
1. And, develop the Art of Data Science
As Ed points out in Google+, big data is also about exploration and the art of Data Science is an essential element. IMHO this involves more work in the contextual, modeling and inference space, with R and so forth – resulting in new insights, new products, order of magnitude performance, new customer base et al. While this stage is effortless and obvious in some domains, it is not that easy in others …
Computer hardware has evolved a lot since the early days of Hadoop – I am always looking for insights on this topic. At my previous company we had spent sometime on the infrastructure and architecture side of Hadoop.
So when I came across Eric’s blog at Hortonworks site, I spend sometime reading and thinking through.
Suddenly it dawned on me
Hadoop Hardware selection is not an abstract science at all ! – it is a function of the degrees of separation between the Hadoop Infrastructure & the Analytics end point (which could be a visualization system like Tableau or a recommendation/inference end point or even R program) !
Let me iterate couple of examples, showing how the degrees of separation affect the hardware selection and the Hadoop infrastructure:
- Six Degrees Of Separation :
- When the Hadoop infrastructure is this far away (logically speaking, of course), most probably it is used as feeder in an ETL flow – for example a phone manufacturer processing logs from the carrier or an on-line game operator capturing events from the game system or an e-commerce system that is aggregating multiple streams…
- Hardware characteristics :
- Speeds and feeds rule here. More boxes (e.g Dell 310s), higher GHz, more memory (24 GB+), smaller disk space (4 TB/box) & a Gb network
- Hadoop as as intermediate layer – you have some data in the Hadoop infrastructure, there is some integration but your Data Warehouse has all your analytics systems
- Hardware Characteristics
- Definitely more storage than the previous systems, but the GHz and memory depends on the level of integration (for example joins) and so forth
- You are using the Hadoop & Co (Hbase, Pig et al) as a processing ‘big data’ store, most probably with moderate data integration from other systems.
- You also might have HBase, your users are employing Pig/Hive & the Hadoop infrastructure is directly wired to visualization tools like Tableau.
- Most probably you are augmenting or delegating some data warehouse functions to the Hadoop Infrastructure
- Hardware Characteristics:
- 10 Gb is in your cards; start with 1Gb and explore link aggregation
- 6 or 12 spindles per box would be optimal; 2 GB or 3 GB drives are becoming affordable anyway (An interesting link from hothardware)
- General Thoughts:
- Is it MHz or network-storage ? As Eric pointed out, the ‘dedicated disk’ thread and Scott’s answer are interesting
- Brad’s blog has some insights on the network side of things
- Last time I had talked with the Datascope project at JHU, they were experimenting with running Hadoop in atom boards !
- A lot depends on the datapipeline – if ETL into DW systems, then less storage is OK as it is transitionary anyway
- But if you are running HBase et al, then 2U 12 spindle or 6 spindle is ideal. 2 TB definitely, 3 TB is on its way
- Another dimension is the complexity
- If you are doing joins – mapside or reduce side, or using distributed cache, boxes with more memory would clearly help.
- For joins, I like the memcache based solution, as you can setup a few memcache nodes and also the fact that this opens up for more system-wide integration in an orthogonally extensible way …
- Finally one caveat: The new YARN/HadoopNextgen/0.23 is a wild card. We have to study how that beast behaves vis-a-vis memory, disk and network. I see some newer opportunities (and challenges) !
Based on what I am seeing in the blog sphere as well as offerings from companies, I think a transition from Big Data to Smart Data is happening – and it is a good thing. A quick diagram first and then some musings …
- Big Data = data at scale, connected and probably with hidden variables …
- Smart Data = Big Data + context + embedded interactive (inference/reasoning) models
- [Update via @kidehen] Smart Data = Linked (Big) Data + Context + Inference
- And Analytic clouds turn Big Data into Smart Data
- Smart Data is (inference) model driven & declaratively interactive
- Contextualization – includes integration with BI systems as well as data mashup with structured and unstructured data
- The information like Wikipedia is big data; the in-memory representation Watson referred to is smart data
- Device logs from 1000 good mobile handsets and 1000 not-so-good phones is big data; a gam or glm over the log data after running through several stages of MapReduce is smart data
We have been seeing many waypoints towards this trend towards smart data -
- GigaOm mentions “Organizations won’t necessarily need data scientists to “turn information into gold”if the data scientists employed by their software vendors have already done most of the work. Think about it like functions within spreadsheet applications tuned to specific industries, or … PaaS … Just feed the application some data, push a button, and get results …”
- My take – Big data will get smarter with embedded models and Analytics clouds will turn big data into PaaS ! My thoughts exactly ! Analytics frameworks will mature !
- So while the profession of Data Scientists is in no danger, smarter data will make it easy for everybody to understand data and make inferences …
- An interesting article in Channel Registerscreams “Platform wants to out-map, out-reduce Hadoop Teaching financial grids to dance like stuffed elephants”
- While I don’t want to comment on the headline force fitting all the hadoop buzwords, I do think this is an important development ! From framework to defacto interface standard !
- Makes sense in many ways
- The Hadoop API has a declarative way of specifying the data parallelism, while still enumerating what one wants to do at the data processing stages
- There is nothing else available and as you mention, has become defacto
- Actually the popularity is not accidental, because it has been evolved by many as they use it to solve their problems (in fact the evolution was much better than what Google has done internally – see link)
- One reason Platform chose to implement the API might be because then it is easy for folks to try out with Hadoop and then migrate to Platform’s Symphony – proposing and implementing a new set of programming models and interfaces would have been unsuccessful
- Actually Hadoop is a PaaS, with it’s own stack that reaches out as deep as the network adjacency of racks (simple yet effective) and as high as application topology and everything in between (from storage to block size to number of tasks)
- What it did not do was the cloud aware elastic infrastructure which is being addressed with Hadoop 2.0 (another link)
- [Update] “Data will be the platform for Web 3.0″, says Linkedin’s Reid Hoffman – yep, of course, smart data !
- And MapIncrease can add pizazz …
P.S: Interesting blog on data blog influencers. we are #274, made it just in top 300 !
The new Computer Overlords – A Race Of machines Or The Rise Of the Machines ?
Just a collection of Links – There are lots of blogs and videos and I thought of collecting them in a single place.
Let me know when you find interesting links :
- Book Review : Stephen Baker’s “Final Jeopardy”
- First a little philosophy on what we are :
- A Fight to Win the Future: Computers vs. Humans
- The epilogue says it all – “The essence of being human involves asking questions, not answering them,”
- And The New Yorker says
- “Thoughts are bigger than the things that deliver them. Our contraptions may shape our consciousness, but it is our consciousness that makes our credos, and we mostly live by those.”
- Or is it “”Our trouble is not the over-all absence of smartness but the intractable power of pure stupidity, and no machine, or mind, seems extended enough to cure that.”
- A Fight to Win the Future: Computers vs. Humans
- NOVA – Smartest Machine On Earth
- Watson’s Visual Identity [Blog]
- Jeopardy, IBM and Wolfram | Alpha [Link]
- Watson Schematic – Good interactive visualization of it’s internals !
- Dissecting IBM Watson[Link]
- 90 IBM Power 750 servers / 10 racks / 16 TB Memory/ 2,880 cores / Linux / Estimated cost $1-$2 Billion [Pictures from Time]
- And NLP Of course,…
- Watson is not connected to the Internet.
- Of course, it is part of IBM Research’s DeepQA project
- But it has Hadoop ! Jeopardy Goes Hadoop [paper] And uses Apache UIMA
- The questions that it did best at are ones that if you entered into Google or Bing, you can get the same answers.
- Watson was marvelous at coming up with “cut and dried” answers … but it seemed to falter though at those “nuanced” questions so prevalent in Jeopardy!
- And it cannot hear, so repeats wrong answers from other contestants
- No matter how a product fares, there is always a healthy respect for the sweat and ingenuity that developers, engineers and computer scientists put into the products
- YouTube Video ~25 min : Building Watson – A Brief Overview of the DeepQA Project
- “Watson does suck in large corpuses of text, but it doesn’t just search through: it uses natural language processing to construct sets of semantic relationships between words in different contexts.”
- “It’s also capable of (to use the designers’ phrases) temporal reasoning, geospatial reasoning, and statistical paraphrasing. Watson also does a lot of meta-learning; it figures out which of its approaches tend to be useful for particular kinds of questions.”
- Dataset & Storage
- Watson’s FJ wagering strategies include Two-Thirds Betting and Shore’s Conjecture
- Book On Watson – Final Jeopardy: Man vs. Machine and the Quest to Know Everything
- Jim’s AI Blog with lots of videos on Watson
- The Aftermath
- And finally interview with Ken Jennings and an opinion on the new computer overlords !
- The question now is not
canbut when computers pass the Turing Test [Link]
- Watson still can’t think !
- Time’s 10 Questions for Watson
- [Update Feb 23,xi] Interesting Blogs
- Of course, while you might not be able to buy one, you can build one !
- Watson plans to say “Yes, Doctor” ! One is coming near your doctor ! [Another article]
- Or may be this is all a farce and there is actually a wizard behind the Watson !
Exciting News, Hadoop is evolving ! As I was reading Arun Murthy’s blog, multiple thoughts crossed my mind on the evolution of this lovable toy animal — My first impressions …
- From Data at Scale To Data at Scale + with complexity – connected & contextual
- From (relatively) static MapReduce To (somewhat) dynamic analytic platform
- While we might not see a real-time Hadoop soon, the proposed ideas do make the platform more dynamic
- The “support for alternate programming paradigms to MapReduce” by decoupling the the computation framework is an excellent idea
- I think it is still Mapreduce at the core (am not sure if it will deviate) but generic computation frameworks can choose their own patterns ! I am looking forward to BioInformatics applications
- The “Support for short-lived services” is interesting. I had blogged a little about this. Looking forward to how this evolves …
- I am hoping that it would be possible via extensible programming models to interface with programming systems like R.
- Embeddable, domain specific capabilities (for example algorithmics specific to bioinformatics) could be interesting
There are also a few things that might not be part of this evolution
- From Cluster to Cloud ?
- There is a proposed keynote by Dr. Todd Papaioannou/VP/Cloud Architecture at Yahoo, titled “Hadoop and the Future of Cloud Computing”.
- Actually I would prefer to see “Cloud Computing in the future of Hadoop” ;o) Had a blog few weeks ago … I was hoping for a project fluffy !
We need to move from a cluster to an elastic framework (from compute and storage prespective) – especially as Hadoop moves to an Analytic Platform. “The separation of management of resources in the cluster from management of the life cycle of applications and their component tasks results” is a good first step, now the resources can be instantiated via different mechanisms – cloud being the premier one
- Streamlined logging, monitoring and metering ?
- One of the challenges we are facing in our Unified Fabric Big Data project is that it is difficult to parse the logs and make inferences that help us to qualify & quantify MapReduce jobs.
- This also will help to create an analytic platform based on the Hadoop eco system. Now services like EMR, most probably do the second order metering by charging for the cloud infrastructure, as they spin separate VMs for every user (from my limited view)
In short, exciting times are ahead for Hadoop ! There is a talk tomorrow at the Bay Area HUG (Hadoop User Group) on this topic … plan to attend, and later contribute – this is exciting, cannot remain in the sidelines … Will blog on the points from tomorrow’s talk … [Update : HUG Presentations and Video Link]
I leave you with this picture from The Polar Express … time to jump aboard and enjoy the ride …
In our lab, we are working on a few ideas on Big Data, Cloud and Dataset Storage Infrastructure …. I was working on a slice of it, this weekend …
- First buy four good sized machines (C200-M2 with 48 GB RAM & dual quad core Intel E5520,2.26 MHz)
- Add dual port 10Gb hardware iSCSI-4 w/ToE (TCP Offload Engine) Card (Broadcom 57711)
- In short iSCSI and TCP in hardware!
- Add SAS mezanine Card (LSI 1064E – 4 Port SAS)
- The routing of the SAS cables is a little tricky …
- Open Up the case & take out the baffles
- Now comes the hard part – disconnect the short SATA cable, connect the long SAS cable to the LSI 1064E card & install the card
- Then route the cable properly, connect & tie the cables
- Looks good, except for the black tie wrap
- Baffles back & the new machine ready to rumble HDFS – fast & furious !
Viola – you have the satisfaction of hacking together a set of mean storage machines (4 X 8 TB) that can host HDFS in a cloud, either as pure Data Nodes (over iSCSI) or as Data/Task Nodes (w/local storage)
And off to the proving ground …
I finished building 4 machines, installed Ubuntu 10.10, assigned IP addresses for the Integrated Management Controller and the data Ethernet ports, checked everything (one port not working,…) and ready for Hadoop tomorrow — all in all good work for a long weekend … !
And then to the C3L Lab … which has actually doubled since then …
For those inquiring minds, the architecture is:
- Mean compute blades – doesn’t matter bare metal or virtualized, but definitely in an elastic infrastructure
- W/ Top Of Rack intermediate nodes hanging off of hardware assisted storage/network (iSCSI/ToE)
- Thus decoupling the storage & compute to form a Hadoop Cloud
- This is still IaaS as we also need a Hadoop Cloud framework for a multi-tenant PaaS
I will post our bench marks – on multiple dimensions (we have built an unified monitoring system using Netflow, Ganglia, UCS and Hadoop monitoring to qualify and quantify across these dimensions) :
- virtualized vs. baremetal
- virtualized with VIC cards
- different application archetypes – I/O bound vs memory hogs vs. hybrid