I spent the summer working on the Azure Research Engagement project within the Cloud Computing Futures (CCF) team at Microsoft Research’s eXtreme Computing Group (XCG). My project was to design and build CloudClustering, a scalable clustering algorithm on the Windows Azure platform. CloudClustering is the first step in an effort by CCF to create an open source toolkit of machine learning algorithms for the cloud. My goal within this context was to lay the foundation for our toolkit and to explore how suitable Azure is for data-intensive research.
Unfortunately, high school ends late and Berkeley starts early, so the internship was compressed into just seven weeks. In the first week, I designed the system from scratch, so I got to control its architecture and scope. I spent the next two weeks building the core clustering algorithm, and three weeks implementing and benchmarking various optimizations, including multicore parallelism, data affinity, efficient blob concatenation, and dynamic scalability.
I presented my work to XCG in the last week, in a talk entitled "CloudClustering: Toward a scalable machine learning toolkit for Windows Azure." Here are the slides in PowerPoint and PDF, and here’s the video of the talk. On my last day, it was very gratifying to receive a request from the Azure product group to give this talk at a training session for enterprise customers :)
- Introduction by Roger Barga, my manager - http://www.youtube.com/watch?v=Sy6MyB_w0fs
- General introduction - http://www.youtube.com/watch?v=djkiyhG0e4A
- Technical introduction - http://www.youtube.com/watch?v=N9BsoXze61Y
- Algorithm and implementation - http://www.youtube.com/watch?v=MpAGwyFQqHw
- Optimizations (Part 1) - http://www.youtube.com/watch?v=bU43KnbCfxs
- Optimizations (Part 2) and Results - http://www.youtube.com/watch?v=vxucDtIpttI