Thursday, April 28, 2011

Google Summer of Code Acceptance

A few days ago, I was delighted, and somewhat relieved to find out that my proposal to the Google Summer of Code 2011 program was accepted. Since 2005, each summer, Google funds undergraduates, masters and PhD students from all over the world to develop open source software. To apply, one needs to find an open source organization of his/her interest, find a suitable problem, and submit a proposal detailing the project to Google.

You can see my proposal here.

Apache Software Foundation is a well known body in the field of open source. As a matter of fact, the Apache Web Server was one of the first adopters of the Open Source Software license in late 90's. Since then, ASF has spawned a number of sub projects, and Mahout is one of them.

Apache Mahout is a highly scalable machine learning library written in Java. Mahout's aim is to provide users with easy to use data mining and machine learning tools capable of handling extremely large data sets, with a commercially friendly license. Anyone can download and use the library for free and use it for business, academic research or educational purposes.

I have proposed to extend one of Mahout's existing programs, called the Baum-Welch Hidden Markov Model Trainer (BW) so that it can be simultaneously run on any number of computers connected to each other on a network. This is important because the program in its current state can only handle as much data as can be fit on one computer. The trainer performs best when it is give a lot of data. One of the ways to handle a large data set is to divide it into smaller pieces, distribute the splits to a bunch of machines and have each machine run the same program on its share of data.

The BW trainer has a lot of applications in places where we have to predict something by looking at data which may not present the complete picture of the underlying process. For example, handwriting recognition. Here, the challenge is that people may scribble "A" in almost infinite ways and one has to develop a program which can predict that the letter was indeed A, by accounting for all the variability in human handwriting. Similarly in speech recognition, the BW is used to predict what the speaker said which can be then be printed, emailed or stored. One can extend these simple applications to more exotic cases, for instance finding out the parts-of-speech of a sentence and using that information to translate from one language to another. BW is also extremely widely used to analyze DNA and protein sequences to discover genes.

One must feed the BW some initial data using which it "learns" the nature of the problem, and then one can apply the "learned" model to predict. The prediction accuracy of BW improves if it is given more data to train, hence my motivation to improve the existing Mahout's code which as of now can only handle limited data.

The challenge for me was to find out a way to "break" the sequential nature of the BW and show that it could be made to run in parallel. I was not too concerned with the programming aspect of the project which although being non-trivial, presents less of a problem because I code for a living each day at UMass. However, investigating whether a sequential algorithm invented in 1950's could be made to fit on a particular parallel computation scheme (Map Reduce) was much more challenging because of its fundamental, mathematical nature. I wanted to be absolutely sure that the algorithm could indeed be made parallel before proposing the project. Since I had started early, around mid February, I had some time to think and research about the problem more. It took me almost one and half months of brainstorming to figure it out. During this process, I read a lot of research papers on machine learning, probability, statistics and parallel computation. At the end of it all, I was pretty confident of the feasibility of the project and knew that the BW could be made to handle data in chunks and still arrive at the same result. The final proof came just a few days before I formally proposed the project when I discovered this excellent book which contained a technique similar to what I was thinking.

Soon after I submitted the project proposal, I was interviewed telephonically by Grant Ingersoll which went great. I will be mentored by Grant for the project who incidentally, is an alum of the Amherst College.

So excited, I'm all set for a Summer of Code!

About Me

Northampton, MA, United States
I'm a curious character who likes doing intellectually and physically demanding things.