Finally we have our VPC and Mongo replica sets working. I still have to figure out the snapshots. Some notes – would appreciate comments, ideas, insights & wisdom. I have the full slides at slideshare.
I will post my notes from snapshot configuration …
I had conducted a Tutorial at OSCON-2012 – “The Art of Social Media Analysis with Twitter & Python”. Slides are at slideshare and the Python/MongoDB/Networkx programs are in GitHub. Next day I was fortunate to be interviewed by Mac Slocum – Mac has a way of asking interesting questions. Thanks Mac
These are a series of blogs annotating the slides with notes, as required. Some things are detailed in the slides, but the slides miss lot of the stuff I talked at the tutorial. Am planning on adding the notes in a series of six blogs. This is Part 1 of 6.
The hands-on project, patterns & code ended up handling ~970,000 unique users, a social graph with ~500,000 cliques, some Twitter REST API runs took 19 hrs to complete and the MongoDB was ~6GB in an m2.large aws instance. Will point out some of the interesting big data patterns related to Twitter API and the social graph.
- Aug 4, 2012 – Part 1 Completed
- Aug 4, 2012 – Part 2 Completed
- Aug 4, 2012 – Part 3 Completed
- Aug 5,2012 – Part 4 Being contemplated
Twitter is at a fork – it has achieved certain amount of status and popularity, not to mention utility and value to the society. We all are slowly adapting to the medium and finding out ways of utilizing the medium. My thoughts on the recent changes in API “branding”:
Twitter Tips – A Baker’s Dozen:
The slides capture the detailed bullet points.
Big Data with TwitterAPI – Twitter Tips
In the next blog, we will look at the Big Data Pipeline for a Twitter API eco system and then move on to APIs and Twitter Object Models.
Notes & References from our experiences on the MongoDB Data Layer for BioInformatics. Like they say, don’t blindly execute any scripts & question everything. As I researchd into each aspect, I came across a set of good references. I have annotated and contextually ordered the reference list, to help one make informed NOSQL data infrastructure design & optimization decisions, at the end of this blog.
This is Part 1. Part 2 – MongoOps would cover Backup, Replication & Sharding and finally Part 3 on Aggregation Framework. Let us see how I fair …
Our setup and rationale:
- Ubuntu – Easy admin & maintenance
- Mongo 2.1.x – We need the aggregation framework now. I will update along with newer versions of mongo in the dev branch. So you can count me to test the latest aggregation framework code !
- m2.xlarge- This (baby) beast has 17 GB memory & 2 cores with ~4 MHz CPU. We have multiple collections, fact tables, datasets, analytics and so forth.
- I plan to load test and find the best configuration.
- xfs- Performance, Extensibility.
- “supports I/O suspend & write-cache flushing – critical for multi-disk consistent snapshots”
- “XFS better performance from these moody disks!”
- RAID10- Availability, “Enhance Operational Durability” & Resilience.
- One drawback : Cannot use ebs snapshots
- Replica with a non-RAID ebs (to backup from) should do the trick. Will explain the configuration in Part 2 – MongoOps:Backup
- “AWS protects against drive failure, RAID10 protects against failures at the EBS technology layer”
- 8 X 32 GB- Striping across multiple disk spindles
- Dimishing performance returns after 8 disks
- LVM- gives the extensibility (RAID10 cannot be extended)
- And use xfs_growfs at the file system layer
- Replica Sets & Sharding (Future, will blog)
I started from the excellent writeup on Amazon EC2 Quickstart by Sandeep Parikh. He has done a wonderful job. Found a few things I had to change to fit our setup.
- Ubuntu 12.04 requires a few installs
- sudo apt-get install mdadm
- sudo apt-get install lvm2 xfsprogs
- Three useful commands to inspect the volumes, devices & partitions
- df -h
- fdisk -l
- cat /proc/partitions
- The devices are named xvdf, xvdg, … not sdf,sdg
- For example, sudo mdadm –verbose –create /dev/md0 –metadata 1.2 –level=10 –raid-device=4 /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi
- Added –metadata 1.2 for the mdadm command
- There were a few discussions (Ref: 1.F. Best Practices discussion in mongodbuser group) whether we need –chunk 256 in the command. The conclusion is that while this might help other applications, it is not needed for mongodb
- sudo blockdev –setra 65536 /dev/md0 (not 128) <- change in command
- The conf file is /etc/mongodb.conf in Ubuntu 12.04 (not mongod.conf)
- The init.d file is /etc/init.d/mongodb (not mongod)
- The logpath & dbpath in /etc/mongodb don’t take effect. They are overshadowed by the defaults in the /etc/init.d/mongodb. This threw me off for sometime, finally I made the changes in the /etc/init.d/mongodb file
- I think there is room for improvement. If I get time, I will refactor this file & make it simpler. Don’t want to jump-in without enough deep thought – I am sure the good folks at 10Gen have good reasons for the complexity
- sudo mkfs.xfs -f /dev/vg0/data to format xfs
- And appropriate command for fstab
- echo ‘/dev/vg0/data /data xfs defaults,auto,noatime,noexec 0 0′ | sudo tee -a /etc/fstab
- The volume scheme at the Quickstart is 90% /data, 5% /journal and 5% /log might not work.I had first setup 4 X 30 GB ebs and this gave 3 GB for journal. Then mongod wouldn’t start with the error
- Fri May 4 03:36:57 [initandlisten] ERROR: Insufficient free space for journal files
- Fri May 4 03:36:57 [initandlisten] Please make at least 3379MB available in /data/journal or use –smallfiles
- Fri May 4 03:36:57 [initandlisten]
- Fri May 4 03:36:57 [initandlisten] exception in initAndListen: 15926 Insufficient free space for journals, terminating
- This also threw me a little bit.
- I think a single volume with three directories is better than three separate volumes, especially as they are in the same logical volume. May be the three separate volumes is to limit the disk usage, but if the disk is full in any one of them, the system would be down anyway. So am not sure if the three separate volume buys us anything
- Need apt support for MongoDB dev releases
- 10Gen should add support for development releases in apt. I had to install 2.0.x, download 2.1.x and then copy over the /usr/bin/ directory. It worked now, but I don’t think it is safe
- I will update this blog as I complete the data layer
Finally the references (Helped me to understand the nuances & the details). I have annotated them and ordered contextually to help one to make informed infrastructure design & optimization decisions; start from the beginning & read thru sequentially :
- 10Gen Resources
- EC2 Quickstart – The bible ;o) This is for Amazon Linux, so needs some changes for Ubuntu
- The best source and should be used as a base to build your MongoDB infrastructure.
- A good plan is read the links, move on to rest of the references and then come back to build your infrastructure.
- MongoDB on AWS – An excellent paper (in pdf format) by Miles Ward
- Overview & Topology
- MongoDB on EC2 overview
- MongoDB Best Practices – Good overview tips
- EBS Overview – must read to understand EBS
- RAID10 your EBS – good blog on why RAID10 for ebs; as aws provides redundancy, why do we mirror ebs volumes ?
- ServerFault Q&A on RAID10 & aws
- Getting good I/O from ebs – A few good points. Make sure to read rest of this section before making any drastic steps ;o)
- AWS ebs benchmarks – gives one an insight into RAID0, RAID10 et al
- Best Practices for MongoDB/RAID – discussion thread at mongodb-user Google group
- MongoDB memory usage & working set – discussion at mongo-user google group. Might help you with sizing the instance
- MongoDB, RAID10 & Ubuntu – includes detailed commands & explanations
- MongoDB, LVM, XFS,RAID10 – Another list of commands
- Installing MongoDB – Quick overview. Would be good to read the rest before jumping into installation
- RAID/LVM blogs
- Quick intro to LVM
- Managing RAID & LAVM with Linux – good intro
- Linux RAID smackdown
- Managing RAID 10
- Grow XFS on LVM
- LVM commands Cheat sheet
- RAID Gory Details
- Complex RAID10 with mdadm
- LVM gory details
- More Gory Details for enquiring minds who want to know