AWS VPC Distilled – for MongoDB DevOps

Finally we have our VPC and Mongo replica sets working. I still have to figure out the snapshots. Some notes – would appreciate comments, ideas, insights & wisdom. I have the full slides at slideshare.


I will post my notes from snapshot configuration …


MongoDB Patterns : The 13th record – Lost & Found

Motivation & Usage

  • You suddenly realize that you cannot find a few records in your production MongoDB & you want to restore them
    • The pattern assumes that you know which records to restore ie you know couple of  unique field values in the lost records
  • Our use case : In our system we have lots of interesting Fact Tables & System Collections. While these collections are small they are very complex & strategic to the application (the why is a topic for another day, another pattern). And we know how many records are there in these collections.
    • Few days ago we realized that we have only 12 records in one of our system collection; it should have 13. We know which record is missing, but it is not easy to recreate.
    • Luckily I have been backing up the database regularly.
    • So, the mission if I choose to accept, is to find the 13th record from the latest backup that contains it and restore the record to our prod database


  • As we are storing data in json format, this will not work if we have blobs
  • Not good for large scale data restore

Code & Program

  • Backup command we have been using:
    • mongodump –verbose –host <host_name> –port <host_port>
    • This creates a snapshot under the <dump> directory. I usually rename it as mongodump2012-MM-DD
  • Start a local instance on mongo – mongod
  • Restore the database from the last backup (for example mongodump-2012-08-21) locally
  • Check to see if the missing record is in this backup and find the _id of the record using command line. If not try the next to last backup – for example mongodump-2012-08-17. (In our case, I found it in the next to last)
  • Copy the record to lostandfound collection
    • db.<collection_where_the_record_is>.find({“_id”:<id_of_the_record>).forEach( function(x){db.lostandfound.insert(x)} );
  • Export the collection
    • mongoexport -v -d <database_where_the_lostandfound_is> -c lostandfound -o paradise_regained.json
  • Import the collection to the prod database
    • ~/mongodb-osx-x86_64-2012-05-14/bin/mongoimport –verbose -u <user_name> -p <password> -h <host_name:port_number> -d <production_database> -c <collection_where_the_record_was_lost> –file paradise_regained.json
  • Check & verify that the correct document is recovered using command line

Notes & References

  • TBD

Big Data With Twitter API : Twitter Tips – A Baker’s Dozen


I had conducted a Tutorial at OSCON-2012 – “The Art of Social Media Analysis with Twitter & Python”. Slides are at slideshare and the Python/MongoDB/Networkx programs are in GitHub. Next day I was fortunate to be interviewed by Mac Slocum –  Mac has a way of asking interesting questions. Thanks Mac

These are a series of blogs annotating the slides with notes, as required. Some things are detailed in the slides, but the slides miss lot of the stuff I talked at the tutorial. Am planning on adding the notes in a series of six blogs. This is Part 1 of 6.

The hands-on project, patterns & code ended up handling ~970,000 unique users, a social graph with ~500,000 cliques, some Twitter REST API runs took 19 hrs to complete and the MongoDB was ~6GB in an m2.large aws instance. Will point out some of the interesting big data patterns related to Twitter API and the social graph.


  • Aug 4, 2012 – Part 1 Completed
  • Aug 4, 2012 – Part 2 Completed
  • Aug 4, 2012 – Part 3 Completed
  • Aug 5,2012 – Part 4 Being contemplated


Twitter is at a fork – it has achieved certain amount of status and popularity, not to mention utility and value to the society. We all are slowly adapting to the medium and finding out ways of utilizing the medium.  My thoughts on the recent changes in API “branding”:

Twitter Tips – A Baker’s Dozen:

The slides capture the detailed bullet points.

Big Data with TwitterAPI – Twitter Tips

In the next blog, we will look at the Big Data Pipeline for a Twitter API eco system and then move on to APIs and Twitter Object Models.



Notes on MongoDB @ AWS-Ubuntu-12.04 XFS, RAID10 & LVM

Notes & References from our experiences on the MongoDB Data Layer for BioInformatics. Like they say, don’t blindly execute any scripts & question everything. As I researchd into each aspect, I came across a set of good references. I have annotated and contextually ordered the reference list, to help one make informed NOSQL data infrastructure design & optimization decisions, at the end of this blog.

This is Part 1. Part 2 – MongoOps would cover Backup, Replication & Sharding and finally Part 3 on Aggregation Framework. Let us see how I fair …

Our setup and rationale:

  • Ubuntu – Easy admin & maintenance
  • Mongo 2.1.x – We need the aggregation framework now. I will update along with newer versions of mongo in the dev branch. So you can count me to test the latest aggregation framework code !
  • m2.xlarge– This (baby) beast has 17 GB memory & 2 cores with ~4 MHz CPU. We have multiple collections, fact tables, datasets, analytics and so forth.
    • I plan to load test and find the best configuration.
  • xfs– Performance, Extensibility.
    • “supports I/O suspend & write-cache flushing – critical for multi-disk consistent snapshots”
    • “XFS better performance from these moody disks!”
  • RAID10– Availability, “Enhance Operational Durability” & Resilience.
    • One drawback : Cannot use ebs snapshots
      • Replica with a non-RAID ebs (to backup from) should do the trick. Will explain the configuration in Part 2 – MongoOps:Backup
    • “AWS protects against drive failure, RAID10 protects against failures at the EBS technology layer”
  • 8 X 32 GB– Striping across multiple disk spindles
    • Dimishing performance returns after 8 disks
  • LVM– gives the extensibility (RAID10 cannot be extended)
    • And use xfs_growfs at the file system layer
  • Replica Sets & Sharding (Future, will blog)

I started from the excellent writeup on Amazon EC2 Quickstart by Sandeep Parikh. He has done a wonderful job. Found a few things I had to change to fit our setup.

  1. Ubuntu 12.04 requires a few installs
    • sudo apt-get install mdadm
    • sudo apt-get install lvm2 xfsprogs
  2. Three useful commands to inspect the volumes, devices & partitions
    1. df -h
    2. fdisk -l
    3. cat /proc/partitions
  3. The devices are named xvdf, xvdg, … not sdf,sdg
    • For example, sudo mdadm –verbose –create /dev/md0 –metadata 1.2 –level=10 –raid-device=4 /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi
  4. Added –metadata 1.2 for the mdadm command
  5. There were a few discussions (Ref: 1.F. Best Practices discussion in mongodbuser group) whether we need –chunk 256 in the command. The conclusion is that while this might help other applications, it is not needed for mongodb
  6. sudo blockdev –setra 65536 /dev/md0 (not 128) <- change in command
  7. The conf file is /etc/mongodb.conf in Ubuntu 12.04 (not mongod.conf)
  8. The init.d file is /etc/init.d/mongodb (not mongod)
  9. The logpath & dbpath in /etc/mongodb don’t take effect. They are overshadowed by the defaults in the /etc/init.d/mongodb. This threw me off for sometime, finally I made the changes in the /etc/init.d/mongodb file
    • I think there is room for improvement. If I get time, I will refactor this file & make it simpler. Don’t want to jump-in without enough deep thought – I am sure the good folks at 10Gen have good reasons for the complexity
  10. sudo mkfs.xfs -f /dev/vg0/data to format xfs
  11. And appropriate command for fstab
    • echo ‘/dev/vg0/data /data xfs defaults,auto,noatime,noexec 0 0’ | sudo tee -a /etc/fstab
  12. The volume scheme at the Quickstart is 90% /data, 5% /journal and 5% /log might not work.I had first setup 4 X 30 GB ebs and this gave 3 GB for journal. Then mongod wouldn’t start with the error
    • Fri May 4 03:36:57 [initandlisten] ERROR: Insufficient free space for journal files
    • Fri May 4 03:36:57 [initandlisten] Please make at least 3379MB available in /data/journal or use –smallfiles
    • Fri May 4 03:36:57 [initandlisten]
    • Fri May 4 03:36:57 [initandlisten] exception in initAndListen: 15926 Insufficient free space for journals, terminating
    • This also threw me a little bit.
    • I think a single volume with three directories is better than three separate volumes, especially as they are in the same logical volume. May be the three separate volumes is to limit the disk usage, but if the disk is full in any one of them, the system would be down anyway. So am not sure if the three separate volume buys us anything
  13. Need apt support for MongoDB dev releases
    • 10Gen should add support for development releases in apt. I had to install 2.0.x, download 2.1.x and then copy over the /usr/bin/ directory. It worked now, but I don’t think it is safe
  14. I will update this blog as I complete the data layer

Finally the references (Helped me to understand the nuances & the details). I have annotated them and ordered contextually to help one to make informed infrastructure design & optimization decisions; start from the beginning & read thru sequentially :

  1. 10Gen Resources
    1. EC2 Quickstart – The bible ;o) This is for Amazon Linux, so needs some changes for Ubuntu
      1. The best source and should be used as a base to build your MongoDB infrastructure.
      2. A good plan is read the links, move on to rest of the references and then come back to build your infrastructure.
    2. MongoDB on AWS – An excellent paper (in pdf format) by Miles Ward
    3. Overview & Topology
    4. MongoDB on EC2 overview
    5. MongoDB Best Practices – Good overview tips
  2. MongoDB/AWS
    1. EBS Overview – must read to understand EBS
    2. RAID10 your EBS – good blog on why RAID10 for ebs; as aws provides redundancy, why do we mirror ebs volumes ?
    3. ServerFault Q&A on RAID10 & aws
    4. Getting good I/O from ebs – A few good points. Make sure to read rest of this section before making any drastic steps ;o)
    5. AWS ebs benchmarks – gives one an insight into RAID0, RAID10 et al
    6. Best Practices for MongoDB/RAID – discussion thread at mongodb-user Google group
    7. MongoDB memory usage & working set – discussion at mongo-user google group. Might help you with sizing the instance
  3. MongoDB/Ubuntu
    1. MongoDB, RAID10 & Ubuntu – includes detailed commands & explanations
    2. MongoDB, LVM, XFS,RAID10 – Another list of commands
    3. Installing MongoDB – Quick overview. Would be good to read the rest before jumping into installation
  4. RAID/LVM blogs
    1. Quick intro to LVM
    2. Managing RAID & LAVM with Linux – good intro
    3. Linux RAID smackdown
    4. Managing RAID 10
    5. Grow XFS on LVM
    6. LVM commands Cheat sheet
  5. RAID Gory Details
    3. Complex RAID10 with mdadm
  6. LVM gory details
  7. More Gory Details for enquiring minds who want to know