For those who wandered in, a little late:
These are a series of blogs annotating my OSCON-2012 slides with notes, as required. Some things are detailed in the slides, but the slides miss lot of the stuff I talked at the tutorial. This is Part 2 of 6. Link to Part 1.
Big data pipeline:
As I had written earlier, one cannot just take some big data and analyze it. One has to approach the architecture in a systemic way with a multi-stage pipeline. For a Twitter API system, the big data pipeline has a few more interesting nuances.
- First tip, of course, is implement the application as a multi-staged pipeline. We will see a little later what the stages are
- A Twitter based big data application would need orthogonal extensibility to add components at each state viz. may be a transformation, may be an NLP (Natural Language Processor), may be an aggregator, a graph processor, multiple validators at different labels and so forth. So think ahead and have a pipeline infrastructure which is flexible and extensible. For our current project, I am working on a python framework-based pipeline. More on this later …
- Have a functional programming approach for the pipeline, consisting of small components, each doing a well defined function at the appropriate granularity. caching, command buffer, checkpointing, restart mechanisms, validation, control points – all are candidates for pipeline component functions.
- Develop the big data pipeline in the “Erlang” style supervisor tree design principle.Expect errors, fail fast and program defensively – and know which is which. For example the rate limit kicks in at 300 calls and it also tells you the ratelimit-reset. So if you see ratelimit, calculate when the reset will happen (ratelimit reset – current time) and then sleep until the reset.
- Couple of ideas on this : First there should be control numbers, as much as possible – either very specific numbers (for example, check the tweets received with the number of tweets from the user API) or general number (say 440 million tweets per day, or 4.4 million per day (1%) and if you see substantially lower numbers, investigate)
- Second, the spiders that do the collect should not worry about these numbers or even recovering from errors that it cannot control. The supervisory processes will compare the control numbers and will run the workers as needed
- Checkpoint frequently – this is easier with a carefully tuned pipeline because you can checkpoint at the right granularity
- And pay attention to handoff between stages and the assumptions the downstream functions make and need – for example an API might return a JSON array while the next stage might be doing a query which requires that the array be unrolled into multiple documents (documents as in the mongodb-speak)
Big Data Pipeline for Twitter:
While the picture is a little ugly and busy, it captures the essence. Let us go thru each stage and dig a little deeper.
- Collect, most probably is the simplest, yet the most difficult in a big data world. The Volume & Velocity, make this stage harder to manage.
- I prefer a NOSQL store – for example MongoDB, as it also stores JSON natively. Keep the scheme as close to Twitter as possible – would be easier to debug
- The store stage also is a feedback loop.
- First have control numbers to check that the collect did get what it is supposed to get
- If not, go back to collect and repeat the Crawl-Store-Validate-Recrawl-Refresh loop. The is the Erlang style supervisor I was talking about earlier
Transform & Analyze
- This where you can unroll data structures, dereference ids to screen_names and so forth. There might be some Twitter API calls (for example to get user information). It is better to make these calls and collect metadata at the collect-store stage
Model & Reason
This is the stage where one applies the domain data science – for example NLP for sentiment analysis or graph theory for social network analysis and so forth. It is crucial that the principles from the domains are congruent to the Twitter. Twitter is not same as friendship networks like LinkedIn or Facebook. Understanding the underlying characteristics before applying the various domain ideas are very important …
Predict, Recommend & Visualize
This, of course is when we see the results and use them.
Now we are ready for Part 3, The API and Object Models for Twitter API