"How to write a great research paper"

I stumbled across some helpful slides in a comment on FemaleScienceProfessor's blog: "How to write a great research paper" (by Simon Peyton Jones, a researcher at MSR!)

Looking back at my summer with Microsoft

August 13th was my last day as a Microsoft intern. Ever since then, I’ve been missing working with great people, reading lots of interesting papers, and contributing to a larger effort in the best way I know -- by writing code :)

I spent the summer working on the Azure Research Engagement project within the Cloud Computing Futures (CCF) team at Microsoft Research’s eXtreme Computing Group (XCG). My project was to design and build CloudClustering, a scalable clustering algorithm on the Windows Azure platform. CloudClustering is the first step in an effort by CCF to create an open source toolkit of machine learning algorithms for the cloud. My goal within this context was to lay the foundation for our toolkit and to explore how suitable Azure is for data-intensive research.

Unfortunately, high school ends late and Berkeley starts early, so the internship was compressed into just seven weeks. In the first week, I designed the system from scratch, so I got to control its architecture and scope. I spent the next two weeks building the core clustering algorithm, and three weeks implementing and benchmarking various optimizations, including multicore parallelism, data affinity, efficient blob concatenation, and dynamic scalability.

I presented my work to XCG in the last week, in a talk entitled "CloudClustering: Toward a scalable machine learning toolkit for Windows Azure." Here are the slides in PowerPoint and PDF, and here’s the video of the talk. On my last day, it was very gratifying to receive a request from the Azure product group to give this talk at a training session for enterprise customers :)
These seven weeks were some of the best I've ever had -- and for that I especially want to thank my mentors, Roger Barga and Wei Lu. I'd love to come back and work with them again next year! :)


First week as a Microsoft Research intern

My first week as a Microsoft Research intern has been a lot of fun! Here are a few highlights:

MSR Intern Technology Connections: I attended a fascinating series of talks by the team leaders of Microsoft's various dev tools on Tuesday morning. Some of the best ones:
  • A behind-the-scenes look at how LINQ works in C# by Eric Lippert.
  • A demo of some of Visual Studio 2010 Ultimate's cool features by Justin Marks. (It costs $11,899 :O)
    • IntelliTrace, a way to step backwards through a program's execution history
    • Architecture Explorer, a neat visualization of program flow and dependencies.
    MSR lectures: Interns can sign up for a stream of invitations to Microsoft Research lunches with notable researchers. In the next two weeks, I'm going to attend "brown-bag" lunches with Dan Reed (the leader of the eXtreme Computing Group) and Leslie Lamport (the creator of LaTeX). Super cool! :)

    What I'll be working on: Building a dynamically scalable, fault-tolerant distributed k-means algorithm on Windows Azure.

    The environment: I'm the only high school intern in XCG, and they don't generally take college interns, so I'm surrounded by PhD interns. It's a great learning opportunity :)


    Resources for getting started with Windows Azure

    My internship at Microsoft Research's Cloud Computing Futures Group is starting next Monday, and I'm trying to get ramped up on high-performance computing with Windows Azure as quickly as possible so I can start developing real code sooner. Here are two of my favorite resources so far:
    • "Windows Azure for Research," a presentation from the same group that I'll be working with over the summer. This is a concise summary of Azure's features and possibilities -- and a great way to get excited about the platform!

    • Programming Windows Azure, a new book from O'Reilly -- by a member of the Azure product group. This is a well-organized and up-to-date guide, and the author's enthusiasm for the subject comes through :)
      (Unfortunately, some of the code samples are poorly formatted in terms of indentation and variable naming. Still readable enough, though.)


    Interning at Microsoft Research over the summer

    A few months ago, I decided to apply to Microsoft as a summer intern. I recently heard back from them, and I'm looking forward to joining Microsoft Research's Cloud Computing Futures Group.

    I'll be working on the "Client + Cloud" effort. Currently, researchers need access to their own clusters to do heavy data processing. It would be more efficient to do number crunching in the cloud, where resources can scale along with researchers' needs. But many of researchers' algorithms require very low inter-node latencies, and clouds built of commodity hardware can't guarantee that. Over the summer, I'll be adapting these kinds of algorithms to work with the cloud's relatively high inter-node latencies, specifically using Windows Azure.

    In many ways, this is my ideal internship. It provides a nice start in the research field, with the potential for a paper in a year or two. It's in an area of Microsoft that's on the leading edge -- as Steve Ballmer stated, cloud computing is Microsoft's future. And the Cloud Computing Futures Group has strong ties with UC Berkeley, so I'll be able to collaborate even beyond this summer.


    More LaTeX document classes: resume and cover-letter

    I was making a resume and cover letter to apply to some internships recently, and I was trying to use res.cls and letter.cls to make them. But whenever I wanted to tweak something, the complexity and TeX-ness (as opposed to LaTeX-ness) of these standard document classes made things more difficult than I liked.

    So, since what I wanted was fairly simple, I decided to reinvent the wheel with the resume and cover-letter classes. Here's the source, screenshot and usage for each.



    \name{Ankur Dave}
    \addressone{1234 Abc Road}
    \addresstwo{San Jose, CA 95101}




    \employer{Berkman Center for Internet and Society at Harvard University}
    \location{Cambridge, MA}
    \jobtitle{Summer Intern}
    \dates{July---August 2009}
    Position description.


    \schoolname{Interlake High School}
    \dates{September 2006---June 2010}




    Your address and contact info

    Recipient's address




    Cover letter body



    Getting started at the Berkman Center

    It's been a week since I arrived in Cambridge for my internship with Harvard's Berkman Center for Internet and Society. It's been fun and interesting living on my own and working at the Berkman Center.

    I'm working in the HerdictWeb team, which runs the Herdict web site. HerdictWeb's goal is to detect Internet censorship around the world through large volumes of reports by a community of volunteers in different countries. It's an effort to use the power of the crowd (also known as "crowdsourcing") to create transparency in governments around the world. The Herdict project is run by Prof. Jonathan Zittrain.

    My first task in HerdictWeb is to create an alert system similar to Google Alerts that allows users to sign up for email alerts based on changes in Herdict censorship report data. The alerts would have three parameters:
    • Content (a country or a site) -- a reporter might want updates on reports in Iran, and a site owner might want updates on reports regarding their site.
    • Threshold (a number of reports or a percentage change) -- the level of the reports before the alert is triggered. For example, an alert should only happen when 20 reports in Iran happen in a day, or when the number of reports in any country increases by 20% over the previous day.
    • Frequency (a time period) -- a cap that prevents alerts from being sent more frequently than specified. For example, an alert may be triggered at maximum once per day.

    Normally, the alert project is at most a week-long project, but it's complicated by the fact that I'm not yet entirely familiar with the HerdictWeb codebase, and in fact neither is anyone else on the team (two others) -- they're new to the project as well. In addition, the previous developers of the project seem not to have checked in a full build configuration to revision control, so even after I was able to get the project to build, a lot of the output was missing. As a result, I've spent the past week setting up the build environment and reading different parts of the project's code.

    More generally, life so far in Cambridge has been pleasant. I'm living in an apartment that I'm renting from my cousin, and my two roommates are nice (if a little messy). I've been cooking my own meals based on instructions my mom wrote for me, which has turned out to be surprisingly easy and tasty. There's a Whole Foods two blocks away and a Trader Joe's not much further, so buying groceries is very easy.

    My uncle, who lives in Syracuse, NY, lent me his bike while I'm here, and that's been really helpful. Even though everything I need to do day-to-day is within a two-mile radius of my apartment, having a bike makes it possible to go back and forth between the Berkman Center's two offices without thinking twice. There are plenty of bike lanes around Cambridge, but most streets are one-way, complicating my routes a little.

    So overall, Cambridge is better than I expected, and independence is a nice feeling!


    Creating your own LaTeX document class

    I've been using LaTeX for a few years, and every time I make a new document, I always start by copying a similar document I've made in the past. So over time, the preamble to each document—the header, where formatting and package includes go—has kept getting bigger. And since I frequently create new commands to simplify things while I'm writing a document, those document-specific commands end up in unrelated documents.

    For example, I wrote my IB Extended Essay last year, and the formatting I used (1-inch margins, ruled headers and footers with my IB candidate number, 3-point paragraph skips, etc.) became my standard formatting for all IB documents. So whenever I was starting a new IB assignment, I copied my Extended Essay, often neglecting to delete the unnecessary document-specific commands like the following:
    \newcommand{\coderef}[2]{\ref{#1}, page \pageref{#1:#2}, line \ref{#1:#2}}
    So I finally decided to package the common document types I use into custom LaTeX classes, just like the built-in article.cls class. So far, I have two classes, interlake-assignment.cls and ib-assignment.cls. Both of those let you define several fields like candidatenum and wordcount that then get printed out into an appropriately formatted title section.

    That makes it possible for me to start a fully-formatted new assignment with just a few lines:
    \title{Modeling the Course of a Viral Illness and Its Treatment}
    \subtitle{IB Math HL Type II Portfolio}
    \author{Ankur Dave}
    \updateheaders % a bit of a kludge to get the title working properly

    ... % document-specific packages and macros


    ... % document body

    Here's the source for the two classes, as well as a screenshot of each one being used:
    (Update 2009-04-20: I fixed the links; thanks to Lincoln Berkley from New Zealand for pointing out that they were broken.)


    Microsoft Seattle regional FIRST robotics competition

    I just got back from the competition. Our team, Saints Robotics, didn't do so well, mainly because various components of our robot failed throughout the competition. We ended up seeding in 30th place out of 31 teams, and though the top alliance in the finals chose us to be part of their alliance (probably out of pity or impulse), our robot failed again during one of the quarter-final rounds, so we were out of the finals.

    So we messed up in this year's competition. It's not that big of a deal, but what can we do next year to avoid it happening again? First let's see what went wrong.

    The most important contributing factor to our failure was that we introduced unnecessary complexity. Our drivetrain, rather than using the known-good tank drive system (each side of wheels runs together) we tried to build a car-like drive system (rear wheels are powered, front wheels turn). We chose this much more complicated system in the hopes of getting more maneuverability. As far as I know we didn't have any hard evidence to back up the claim that the maneuverability justified the extra complexity. So next year we should either just use tank drive, or prototype a different system well in advance so we can really understand whether the new system is worth it.

    Another problem was robustness. During most of the competition rounds, when our robot was jarred a little too hard, something would go wrong. The problems included a short in the electronics, the battery coming loose, and something bad, the details of which I never figured out, happening to the wheels. This is a hard problem to avoid, but an important countermeasure we can take is stress testing. We need to stop being so protective of the robot, getting scared if it so much as runs into a wall, and instead kick it around a bit. Run it into bricks, kick it, drop stuff on it, and so on. Also, we need to design it to be robust. The main thing we can do to that end is make sure everything is organized and planned out. One important example is the wiring. We should have a map of the electronics layout done ahead of time, with places for wires, so nothing comes loose.

    Similar to the problem of too much hardware complexity, our software was too complex in similar ways. Because we had so many sensors, most of which weren't worth the complexity they introduced, we had to add lots of sensor code all over the place. Our code wasn't version controlled. This was a major problem that arose mainly because the mentor who knew how to program the sensors didn't want to use version control, instead littering the code with #ifdef's (for building for FRC or Vex) and #if 0's (for disabling code). He also didn't communicate well about what changes he was making to the code, so we were constantly confused about whether problems were caused by the software or the hardware. Next year, we will start from scratch, and I will make sure everyone on the development team understands what every part of the code does so that anyone can fix problems and no one treats the code like a magical black box. We will also only attach sensors if we are certain that their benefits will outweigh their complexity and added risk of failure. We'll keep the code under version control, and make sure to have different branches that each work around a missing sensor so we're always prepared for failures.

    So we did pretty badly this year. But I'm still glad that next year at least we'll have a big list of things not to do :)

    Update 2008-03-22 20:23: Well let me correct myself—we weren't second to last but rather second because the alliance that chose us finished second. Of course, we only got second because our robot died and got swapped out with a more useful one. Still, it's certainly a good feeling to be second :D


    Upgrading TiVo's hard drive the second time

    We've had a TiVo for three or four years. It started out as an 80-hour Series 2 model, but two years ago I replaced its 80GB hard drive with a 300GB one using the Hinsdale guide. That was plenty of space until a few months ago, when we ran into the space limit again.

    So we bought a 750GB drive. I tried using the same guide to upgrade the 350GB drive to the 750GB one, but the MFS Tools CD wouldn't even boot this time. I found a newer version, MFS Live, and that at least booted. But when trying to run the command to copy the contents of the old drive onto the new one, it gave the error "Backup target not large enough." I decided just to use Linux's dd (disk dump) command to do a bit-by-bit copy of the old drive onto the new one, and from there I planned to expand the copied image using the mfsadd tool provided.

    I left the disk copy running overnight, and the next morning I tried to expand the image. When that didn't work (the command said the image was already expanded), I looked around and found out that you can't expand an already-expanded image. Finally, though, I found a tool that advertises that it can: WinMFS. Unfortunately, this requires Windows, but luckily I had an install of Vista lying around on my computer. So I used the MFSSuperSize and MFSadd tools on WinMFS, and that worked, resizing the image to the full 750GB.

    So now we have a TiVo with around 1000 hours of recording space. That should last a few more years :)