Human Ingenuity puts a Fiat 500 C infront of VW HQ in Google Street View


Image

Interesting Story at Zagg Blog

Fiat’s workers spotted the Google Street view car, which captures panoramic views of locations around the world with a roof camera, driving past its offices. They quickly drove a Fiat 500C to Volkswagen’s head office @ Södertälje 45 minutes away and parked there ! Just in Time for the StreetView Camera to snap the following picture !

Now the Googel Pegman shows the Fiat in front of VW HQ !

An ode to the Easter Eggs, Ecstasies & Agonies of a GoogleIO Ticket


Chronicles of my failed attempt at procuring a GioogleIO Ticket … The Google Wallet ate my GogleIO 2013 Ticket !

It was the night before GoogleIO … Excitement was in the air … Tweets were in order …

Image

The order of the day was to find all Easter Eggs in the page …

Image

I clicked and clicked and clicked … and got thru all the easter Eggs …

Image

Image

Image

Image

Image

Image

Image

Image

Image

Image

Image

And I slept …

It was early AM when I woke up … still 15 min before the GoogleIO stores open …

Image

The wait was agonizing, but all for a good cause, so I thought …

I was there when the GoogleIO Ticket store opened …

Image

I was not disappointed when my first try failed after 6 minutes …

Image

And my optimism payed off when it eventually found me a precious little ticket …

Image

I reviewed the purchase … and gave it to Google Wallet … little did I know that …

Image

But the screen stayed there and the time ticked down ….

By now the verdict was clear – The Google Wallet is going to eat my lucky GoogleIO Ticket ….

And It did …..

Image

And soon after the registration ended …. The cold hand of fate …

Image

Can I find a kind soul at Google to help me or should I wait for GoogleIo 2014 ? ….

Tomcat 7 packaging layout & install in Ubuntu 12.04


Today I installed Tomcat7 in Ubuntu. They have changed the layout of the directories and the changes are for good. We all are used to the Tomcat’s old layout and the new layout takes a little time to get used to … At first I couldn’t make head or tail out of it. Then I looked for things where they should be … and viola … it made perfect sense !

  • /usr/share/tomcat7 <- The bn, lib and the rest
  • /usr/share/tomcat7-root <- Am still figuring out what this does
  • /etc/tomcat7/ <- configuration files
  • /var/log/tomcat7/ <- logs. For some reason this directory is rwxr-x—-. Really should be  wxr-xr-x
  • /var/lib/tomcat7/ <- webapps et al go here

MongoDB mongorestore Assertion failure b.empty error


I encountered this error after trying to restore 2.1 mongodumps. (This happened only after installing mongodb 2.2.0):

The Error:

Wed Sep 19 18:33:26 Assertion failure b.empty() src/mongo/db/json.cpp 645
0x10036b5fb 0x10009ad86 0x1004af6f2 0x100016f85 0x100016944 0x100016944 0x100019e54 0x100313b5d 0x100315697 0x10000126a 0x1000011e4 
 0 mongorestore 0x000000010036b5fb _ZN5mongo15printStackTraceERSo + 43
 1 mongorestore 0x000000010009ad86 _ZN5mongo12verifyFailedEPKcS1_j + 310
 2 mongorestore 0x00000001004af6f2 _ZN5mongo8fromjsonEPKcPi + 1634
 3 mongorestore 0x0000000100016f85 _ZN7Restore9drillDownEN5boost11filesystem210basic_pathISsNS1_11path_traitsEEEbbb + 4117
 4 mongorestore 0x0000000100016944 _ZN7Restore9drillDownEN5boost11filesystem210basic_pathISsNS1_11path_traitsEEEbbb + 2516
 5 mongorestore 0x0000000100016944 _ZN7Restore9drillDownEN5boost11filesystem210basic_pathISsNS1_11path_traitsEEEbbb + 2516
 6 mongorestore 0x0000000100019e54 _ZN7Restore5doRunEv + 3140
 7 mongorestore 0x0000000100313b5d _ZN5mongo8BSONTool3runEv + 1325
 8 mongorestore 0x0000000100315697 _ZN5mongo4Tool4mainEiPPc + 5447
 9 mongorestore 0x000000010000126a main + 58
 10 mongorestore 0x00000001000011e4 start + 52

The Cause:

The offending file is the <dump directory/<database>/<collection>.metadata.json file.  It has a line like so:

{options : { "create" : <database>, undefined, undefined, undefined }, indexes:[{ "v" : 1, "key" : { "_id" : 1 }, "ns" :<database>, "name" : "_id_" }]}

The “undefined,undefined,…” is an artifact from 2.1 beta.

The Cure:

Delete the “, undefined, undefined, undefined” and save the json metadata file and rerun the mongorestore.

You might have to do this for a few databases. You can see which data base by looking at the line before the error message like so:

Wed Sep 19 18:33:26 <dump directory>/<database>/<collection>.bson 
Wed Sep 19 18:33:26 going into namespace [<database>.<collection>] 
Wed Sep 19 18:33:26 Assertion failure b.empty() src/mongo/db/json.cpp 645

Cheers

<k/>

MongoDB Patterns : The 13th record – Lost & Found


Motivation & Usage

  • You suddenly realize that you cannot find a few records in your production MongoDB & you want to restore them
    • The pattern assumes that you know which records to restore ie you know couple of  unique field values in the lost records
  • Our use case : In our system we have lots of interesting Fact Tables & System Collections. While these collections are small they are very complex & strategic to the application (the why is a topic for another day, another pattern). And we know how many records are there in these collections.
    • Few days ago we realized that we have only 12 records in one of our system collection; it should have 13. We know which record is missing, but it is not easy to recreate.
    • Luckily I have been backing up the database regularly.
    • So, the mission if I choose to accept, is to find the 13th record from the latest backup that contains it and restore the record to our prod database

Non-usage

  • As we are storing data in json format, this will not work if we have blobs
  • Not good for large scale data restore

Code & Program

  • Backup command we have been using:
    • mongodump –verbose –host <host_name> –port <host_port>
    • This creates a snapshot under the <dump> directory. I usually rename it as mongodump2012-MM-DD
  • Start a local instance on mongo – mongod
  • Restore the database from the last backup (for example mongodump-2012-08-21) locally
  • Check to see if the missing record is in this backup and find the _id of the record using command line. If not try the next to last backup – for example mongodump-2012-08-17. (In our case, I found it in the next to last)
  • Copy the record to lostandfound collection
    • db.<collection_where_the_record_is>.find({“_id”:<id_of_the_record>).forEach( function(x){db.lostandfound.insert(x)} );
  • Export the collection
    • mongoexport -v -d <database_where_the_lostandfound_is> -c lostandfound -o paradise_regained.json
  • Import the collection to the prod database
    • ~/mongodb-osx-x86_64-2012-05-14/bin/mongoimport –verbose -u <user_name> -p <password> -h <host_name:port_number> -d <production_database> -c <collection_where_the_record_was_lost> –file paradise_regained.json
  • Check & verify that the correct document is recovered using command line

Notes & References

  • TBD

Big Data with TwitterAPI – Twitter Objects & APIs


For those who wandered in, slightly late:

These are a series of blogs annotating my OSCON-2012 slides with notes, as required. Some things are detailed in the slides, but the slides miss lot of the stuff I talked about at the tutorial. This is Part 3 of 6. Link to Part 1.

Twitter Object Model:

I found the Twitter object model simple, intuitive and congruent. They have couple of terminologies that one would get used to – tweets are called status and the act of tweeting is status updates. Users whom you follow are friends and the users who follow you are followers.

I have a few slides describing the various objects as well as a few python programs(in GitHub) to actually get the JSON via Twitter API and inspect the actual objects. The setup slide talks about what libraries are needed. As the slides cover the objects in some detail, I will not repeat them here …

Twitter API Model:

 

I found the APIs straightforward, abit a few glitches and mismatches. You have to assume errors and occasional dropped connections. So program accordingly with command buffers (to catchup from where you stopped), check points (to know where you stopped), control numbers monitored by supervisor processes, deal with rate limits, and so forth.

Let us move on to the Applications in the next blog …

Big Data with TwitterAPI – Data Pipeline


For those who wandered in, a little late:

These are a series of blogs annotating my OSCON-2012 slides with notes, as required. Some things are detailed in the slides, but the slides miss lot of the stuff I talked at the tutorial. This is Part 2 of 6. Link to Part 1.

Big data pipeline:

As I had written earlier, one cannot just take some big data and analyze it. One has to approach the architecture in a systemic way with a multi-stage pipeline.  For a Twitter API system, the big data pipeline has a few more interesting nuances.

Tips:

  1. First tip, of course, is implement the application as a multi-staged pipeline. We will see a little later what the stages are
  2. A Twitter based big data application would need orthogonal extensibility to add components at each state viz. may be a transformation, may be an NLP (Natural Language Processor), may be an aggregator, a graph processor, multiple validators at different labels and so forth. So think ahead and have a pipeline infrastructure which is flexible and extensible. For our current project, I am working on a python framework-based pipeline. More on this later …
  3.  Have a functional programming approach for the pipeline, consisting of small components, each doing a well defined function at the appropriate granularity. caching, command buffer, checkpointing, restart mechanisms, validation, control points – all are candidates for pipeline component functions.
  4. Develop the big data pipeline in the “Erlang” style supervisor tree design principle.Expect errors, fail fast and program defensively – and know which is which. For example the rate limit kicks in at 300 calls and it also tells you the ratelimit-reset. So if you see ratelimit, calculate when the reset will happen (ratelimit reset – current time) and then sleep until the reset.
    • Couple of ideas on this : First there should be control numbers, as much as possible – either very specific numbers (for example, check the tweets received with the number of tweets from the user API) or general number (say 440 million tweets per day, or 4.4 million per day (1%) and if you see substantially lower numbers, investigate)
    • Second, the spiders that do the collect should not worry about these numbers or even recovering from errors that it cannot control. The supervisory processes will compare the control numbers and will run the workers as needed
  5. Checkpoint frequently – this is easier with a carefully tuned pipeline because you can checkpoint at the right granularity
  6. And pay attention to handoff between stages and the assumptions the downstream functions make and need – for example an API might return a JSON array while the next stage might be doing a query which requires that the array be unrolled into multiple documents  (documents as in the mongodb-speak)

Big Data Pipeline for Twitter:

While the picture is a little ugly and busy, it captures the essence. Let us go thru each stage and dig a little deeper.

  1. Collect

    • Collect, most probably is the simplest, yet the most difficult in a big data world. The Volume & Velocity, make this stage harder to manage.
  2. Store

    • I prefer a NOSQL store – for example MongoDB, as it also stores JSON natively. Keep the scheme as close to Twitter as possible – would be easier to debug
    • The store stage also is a feedback loop.
      • First have control numbers to check that the collect did get what it is supposed to get
      • If not, go back to collect and repeat the Crawl-Store-Validate-Recrawl-Refresh loop. The is the Erlang style supervisor I was talking about earlier
  3. Transform & Analyze

    • This where you can unroll data structures, dereference ids to screen_names and so forth. There might be some Twitter API calls (for example to get user information). It is better to make these calls and collect metadata at the collect-store stage
  4. Model & Reason

    This is the stage where one applies the domain data science – for example NLP for sentiment analysis or graph theory for social network analysis and so forth. It is crucial that the principles from the domains are congruent to the Twitter. Twitter is not same as friendship networks like LinkedIn or Facebook. Understanding the underlying characteristics before applying the various domain ideas are very important …

  5. Predict, Recommend & Visualize

    This, of course is when we see the results and use them.

Now we are ready for Part 3, The API and Object Models for Twitter API

Cheers