NICAR 2010 talk: Good habits

This is a script for a talk I’ll be delivering shortly, with Jacob Fenton’s assistance, at NICAR 2010 in Phoenix. Readers may find it similar to, though more complete than, my ONA talk, a few posts back. Consider this version better.

For more frequent updates on what I’m up to, visit the News Apps Blog.

UPDATE: The smiley face next to my little Rails joke wasn’t strong enough, added a bit, plus a link.

We’re here to talk about some boring stuff. Get-more-fiber-in-your-diet kind of stuff. It’s titled “Development Techniques” on the schedule, but this talk might be better to call it “Best Practices in Software Engineering”, or “Good Habits When Making Software”, or “Ass-saving Shit That Some Other Smart People Figured Out, Because Your Problems Aren’t New.”

My favorite metaphor for explaining programming to non-coders is that it’s like carpentry. You can put together a chest of drawers with nails and glue, and it’ll fall apart in a year, or you can build something lasting and use dovetail joints. We’re not plumbers providing a utility, but neither are we artists. It’s nice if our work is beautiful, but it also must be durable. We’re craftsmen. We make things that people use.

The point of all this is that craftsmanship matters. So, I’m here to ask you to change your ways, to consider adopting some processes, not because they’re fun, but because they’ll save your ass, and help you do better work. And once you’re in the habit, of writing tests and deployment scripts, of tracking your defects and versioning your code, you’ll wonder how you ever went without.

So, we’re trying something new today. I’m gonna run through these concepts fairly quickly, and in-between, Jacob will reflect on his work adopting many of these practices. It shouldn’t take very long, and at the end we’ll take questions.

Version Control

Version control software is both a safety net and a collaboration tool. It’s a place, usually away from your machine, where you store your code. And when you write new code, it hangs on to your previous versions. Even on a one-person project, version control is essential. When your hard drive crashes, you don’t lose your work. And, when you’re working with others on a common codebase, it acts as a central repository to help coordinate everyone’s changes.

We use Git. Other folks like Mercurial. Subversion would also be a fine choice, though it’s no longer the cool kids’ favorite.

Task Tracking

It may sound bossy, but task tracking is not about micromanagement, or at least it doesn’t have to be. In my experience, on any project, you’ll only really know how deep in the weeds you are if you can see all the tasks, listed out. Also, I find that forgetting to do something is extremely embarrassing. So, you can track tasks in a text file or in a spreadsheet on your desktop, but I’ve found thats teams work better if the TODO list is out in the open. So, go low-tech and use 3×5 cards pinned to the wall — or go high-tech and use one of many software packages designed for the purpose.

We use Unfuddle. Trac is also a fine choice. If you’re using GitHub for hosted Git version control, it comes with issue tracking, but I haven’t heard many people express their love for it. That said, it might be worth a shot.

Defect Tracking

When you find a defect, log it. Take a screenshot, and type up sufficient details to reproduce the problem. This may seem heavy-handed, but defects are your unplanned tasks, they must always be addressed — either by fixing them, or explicitly choosing to let them slide. Known defects are totally okay. But unknown defects, on the other hand, are the devil. So, always, always, please record your defects, even if you’re going to fix them immediately. One of these days, you *will* get distracted half-way through a fix. And you *will* forget. Unlike tasks, I’d say always take the high-tech route with defects. They’re best tracked with software.

We use the same system to track our tasks and defects, Unfuddle. Usually you do it that way. Another catchall option that might work for you is FogBugz.

Staging Environment

Similar to defect tracking, your staging environment is there to reduce uncertainty. It’s an environment — servers, your databases and applications, everything — that you run in parallel to production. It should be identical to your production system. (If you’re using Amazon EC2, this is pretty much as simple as copying your production instance!) Your goal is this: knowing that, if your application works in staging, it will work in production. You can execute load tests and performance tests against your staging environment, as well as test your deployment scripts, and, as a bonus, it can host your work for demos, etc.

We use Amazon EC2 for our hosting, and keep carbon-copy instances running in staging and production at all times. We’ve written about how to set up your own EC2 environment on our team blog.

Load Testing

The Tribune news apps team learned an important lesson in February, when Illinois voters went to vote in the primaries, and our Election Center app was put to the test. We had thought our production setup was great. The harder we abused it, the more load we threw in our tests, it just kept performing. “Great!”, we thought, “This system is gonna work awesome.” Well, you can probably guess where I’m going with this.

We crashed and burned on election day. The Election Center was useless. (For the server nerds in the audience — our top was pegged well over 100.) Luckily, a few Google searches gave us a way to route around the bottleneck (using the awesome pgpool), and we were back up and running after only a half hour or so. The lesson we learned was this: A good test must fail. You need to know your breaking point. Make the servers effing cry. Because they *will* cry. And if you don’t know your limits, you’re asking for trouble. We got very lucky. There was a readily-googleable, turnkey fix for our problem. We might not be so lucky next time.

We use ab to make our servers cry.

Push-button Deployment

When everything is running smoothly, a multi-step deployment process (gather the code, FTP it all to the server, restart apache, etc.) doesn’t seem like so much of a hassle. But when the shit hits the fan, your editor is breathing down your neck, and you’ve gotta fix that bug, fast — let’s say, on an important election day — you’ll screw up. You’ll forget something, and your minor bug will become a nightmare. Everything will break, and you’ll be even more freaked out.

Push-button deployment won’t fix your bugs, but it will help you keep your cool. It will also saves you from the tedium of redeployment, and act as a guide when you need to redeploy your project months or years down the line. If you’re running an identical staging environment, you’re even better off, because you can develop your deployment script for staging, use it a few dozen times, and then when it’s time to roll to production, you know it’ll work.

You can write deployment scripts on your own but there are lots of great tools out there, built to make deployment dead-easy. We use Fabric, and have written about our scripts in great detail. If you’re into Ruby, I’m pretty sure that Capistrano is the current state of the art.

Web Frameworks and Agility

Making websites used to be slow work. Web frameworks make you fast. If you’re fast, you can, obviously, turn around projects in a more timely fashion. But, the maybe less obvious advantage of high-speed development tools is that they enable you to fail fast. And what I mean by that is, it used to be that you’d have to write code for a month before you had anything you could show off. Using frameworks, you can create something interesting very quickly, in days or hours, and the faster you create, the faster you can be critiqued. We never go more than a day or two between show-and-tell sessions with reporters, and when we’re working on a long-running project, we hold reviews with our stakeholders every Friday afternoon. Frameworks enable us to learn from our mistakes and correct course very quickly. They enable us to be agile.

We use Django, a web framework with deep roots in the news industry. There are people here who will tell you to instead use Ruby on Rails. They are not to be trusted. I kiiid. Check out Aron Pilhofer’s post, How Not to Choose a Web Framework.

Testing

Automated tests kick ass. It’s not immediately obvious, but ‘testing’ is about more than merely ensuring correctness. Tests can help you write code faster, and they can save you six months down the road when you’ve half-forgotten about your project. But before they can save you, you’ve gotta write ’em. The tests I most commonly write are called ‘unit tests’. A unit test is a bit of code that checks if another bit of code you’ve written works properly. For example, let’s say you’re writing a web application that calculates people’s income tax obligations. There are a lot of special cases that vary on how much money you make, if you’re paying a mortgage, etc. To test your calculations, you could visit the web page you wrote, over and over again, typing in each special case you can think of. If you’re especially thorough, you might even keep a spreadsheet to check off correct numbers. This would be thorough, but insane. Instead, you should write unit tests — code that exercises each special case automatically, by testing your calculations directly. First, you won’t waste countless hours reloading a web page, and second when, six months later, they update the laws and you’ve gotta fix your code, you can test all the permutations again at a keystroke.

Most web frameworks include a rig for easily testing your work.

Further Reading

I’ll keep the book list short. Pick these two up. Know them. Love them.

All the fun stuff we’ve been up to at the Trib

The blog has been a bit quiet lately (to the disappointment of very few, I’m sure) — but we’ve been releasing apps and blogging furiously over at our team site. Here’s a roundup of our recent posts:

Tools we love to use

Development techniques and best practices we’ve discovered

Sharing our infrastructure

For links to our recent projects, and to keep up on our work, visit apps.chicagotribune.com!

Got a job

Next week, my internship at ProPublica will end. The chance to work here was an extraordinarily lucky break, and I can say without reservation that this is the best job I’ve ever had. Never before have I worked with so many brilliant, interesting, and damn nice people.

I love living in New York, and am disappointed to be leaving so soon. The Grand Army Plaza green market just turned from great to brilliant, and I only had my first, proper NYC pastrami on rye this week.

So it’s somewhat bittersweet to announce that in a couple of weeks, I’ll be leaving NYC and returning to my adopted hometown, sunny Chicago, Illinois.

The World’s Greatest Newspaper

In June I’ll start my first full-time journalism gig, as the News Applications Editor at the Chicago Tribune. The team I’ll be leading will be a new one, composed of programmers and investigative journalists, and we’ll be building news applications in conjunction with the Trib’s fantastic investigative team.

Specifically what we’ll make, I don’t know, but I anticipate building a wide variety of data-driven web applications to visualize data and present investigative stories online. (If only the PolitiFact crew hadn’t set the bar so high…)

For the nerds in the audience

What I do know is that we’ll be using Python, Django and lots of other open-source tools. Chicago has quietly become a very important place in the open-source world — the Second City is home to both Django and Ruby on Rails, the two hottest web frameworks — and I’m committed to making the Chicago Tribune a contributing member of the community.

If you haven’t figured it out yet — I’m geeked. This’ll be fun.

So, adios, City That Never Sleeps. The City That Works is calling me home.

From concept to sketch to software: Building a new way to visualize votes… mmm, environminty!

Ryan Mark and I built enviroVOTE to help people visualize the environmental impact of the 2008 elections. We designed it in two evenings and made it real in a three-and-a-half-day long bender of data crunching and code.

This is the story of that time.

Sketch of enviroVOTE
+ coffee = enviroVOTE is real, live software

Sunday evening, 26 October: the concept

The idea struck us when Ryan and I discovered we had a common problem: homework. Ryan was on the hook to produce a story about the environment for News 21‘s election night coverage, and I needed to build an example presenting news data in some interesting way using charts and graphs. So we decided to combine our efforts and make something that would visualize environmental information about the election.

We searched for data to present, and found that it came in many shapes; like a candidate’s track record of support on environmental issues, or statistics on national parks, nuclear power and everything in-between. But the most compelling data set we found was not stats- or issue-based: endorsements made by environmental groups.

Statistics were cut because they’re only peripherally related to the races being run. It’s not particularly interesting to say something like “in states with more than five hydroelectric power sources, the democratic candidate prevailed 18% of the time.”

Only sportscasters can get away with that crap.

Why not issues, then? They’re hard to quantify. Candidate websites are frequently slippery, ambiguous things, and we found that few politicians responded to efforts that would make their positions crystal clear like Project Vote Smart’s Political Courage Test and Candid Answers’ Voters Guide to the Environment. The best data we could find were candidates’ voting records, but without understanding the nuance of each piece of legislation, it’s nearly impossible to determine if a vote was for or against the goodness of the earth. (Also, only incumbents have voting records.)

An endorsement is a true-false, unambiguous, easy to count thing. Environmental groups like the Sierra Club and the League of Conservation Voters publish their support for candidates online. Even better, the aforementioned Project Vote Smart — a volunteer group dedicated to strengthening our democracy through access to information — aggregates endorsements, and makes them readily available for current and historic races. And Vote Smart makes them available via an API, so others can mash up their data, just like we were itching to do.

Wednesday evening, 29 October: the design

A second fury of inspiration led to the design of the site. Marcel Pacatte, my instructor and head of Medill’s Chicago newsroom, was our source of journalistic wisdom. He and I identified our audience and discussed the angles and presentation methods that would best serve them. Obvious ideas like red/blue states and a map of the nation’s greenness were tossed — maps aren’t all that good at showing off numbers. (Notable exceptions include cartograms and the famous diagram of Napoleon’s march to Moscow, neither of which seemed sensible metaphors to adopt.)

Working out the enviroVOTE concepts on a whiteboard
Scope creep, be damned!

We decided to not make a voter’s guide, since there was little time before the election for folks to find the site, and to instead make something that’s interesting the day of the elections, and useful in the days following. So we looked for numbers to support that mission.

Counting environmentally-friendly victories would be both timely on election night, and purposeful later. We could calculate a win for the earth by counting endorsements: if the winning candidate had more endorsements, it was a green race. This was easy to aggregate nationally as well as by state.

And by running the same numbers on the previous races (two years ago for the House, six for the Senate, etc.) we could calculate the change in the environmental-friendliness of the nation’s elected officials, a figure that became known as “environmintiness.”

In addition, some races potentially held more impact for the environment than others — because of their location or the candidates running — so we decided it was necessary to highlight these key races alongside the numbers.

The sketch that served as the primary design document for enviroVOTE
The sketch that served as the primary design document for enviroVOTE

In a whirlwind sketch-a-thon, the design for the site flew together. We would show off the two big numbers in the simplest possible way. No maps, pies or (praise the lord!) Flash necessary. They’re both just percentages. To set off one from the other, we decided on a percentage for the percent change, and a one-bar chart for the victory counts, in aggregate and for individual states.

Users would be interested in seeing results from their home state, so we made the states our primary navigation, and listed them, along with their bar chart, down the left side of the page. (We explicitly decided to not use a map for navigation, like most sites do. If I lived in Rhode Island, I’d effing hate those sites.)

Putting the big numbers front and center and listing the incoming race results down the right gave users an up-to-the-minute snapshot of the evening. The writeups about key races, though important, were our least timely information, so we made them big and bold, but placed them mostly below the fold.

We produced a simple design, just three pages — home, a state and a race — each presenting more detail as you drilled down.

Saturday and Sunday morning, 1-2 November: the development

Development began Saturday morning. We decided to build the site on Django, the free and open source web development framework that we were concurrently using to build News Mixer, the big final project of our master’s degree program. (If you’re interested in our reasons why, and how it all works, check out my post that explains the same stuff re: News Mixer.)

We brainstormed names for our new baby, and immediately checked to see if the urls were available. envirovote.us was the first one we really liked, so we bought it and started running. Ryan designed a logo and whipped up a color scheme, and thus a brand was born.

Improvising the details, we built the site very closely to as it was designed. (The initial sketches were mine, but Ryan gets the props for making it look so damn sexy.) Coding the site took about a day and a half, minus time for Ryan to go home and sleep, and for me to cook soup.

We used the awesome, free tools at Google Code to list tasks and ideas, manage our source code, and track defects. The simple concept and excellent tools helped make this a relatively issue-free development cycle. Django, FTW!

Sunday afternoon and Monday, 2-3 November: the gathering, massaging, and jamming in of data

Pretty much finished with the code, minus subsequent bug fixes and tweaks, we started on the data.

Ryan used the Project Vote Smart API to gather information on current and historical races: the states, districts, and candidates that form the backbone of our system. He wrote Python scripts to repeatedly call the API, munge the response, and aggregate all of the races, candidates, wacky political parties, and the rest into files we could then pump into the database.

I attacked from the other side and scoured environmental groups’ websites, as well as the endorsement data provided by Project Vote Smart, to collect the endorsements we use to calculate the big numbers.

Once all the data was collected into text files, we then wrote more scripts to read those files, scrub the data of inconsistencies, poor spelling, and other weirdness, and finally fill the database.

All of this took a day and a half, far longer than we had hoped, and as much time as was necessary to build the website. I did not cook soup. We ordered in.

enviroVOTE is real, live software
Coffee, nerd sweat… smells like software. Yet, curiously minty-fresh.

Tuesday, 4 November

After attending class all day in Evanston, Ryan and I headed downtown for an evening of data input and cursing at screens.

Julia Dilday and Alexander Reed watched the AP wire all night, tracking races and gathering results and entering them into the system. I cannot express how much more difficult this was than we anticipated. Julia and Alex: thank you thank you thank you thank you.

Ryan kept the system humming through the night. He tamed the beast: keeping the site online, fixing bugs, and updating the administrative interface in an effort to improve the poor working conditions of Julia and Alex.

I ran the public relations effort: taking interviews, helping input incoming races, and getting the word out about our little project. I also gave enviroVOTE a voice. We set up a Twitter account to tell the nation about environmintiness as the results came in. (For a time, the site automatically twittered with each race result, until we realized that it was sending far more tweets than anyone would ever want to read, and turned it off.)

The aftermath

I’m told the presidential race was noteworthy, though I can’t recall who won — it was just one of nearly 500 races we recorded that night, and we weren’t watching the TV.

Since the 2nd, we’ve fixed a few bugs and we’ve slowly added the final race results as they’ve trickled in. The site is not nearly as dynamic as is was election night, but maybe we’ll have another few days free next year.

How we built News Mixer, part 1: free and open-source software

This post is first in a three-part series on News Mixer — the final project of my masters program for hacker-journalists at the Medill School of Journalism. It’s adapted (more or less verbatim) from my part of our final presentation. Visit our team blog at crunchberry.org to read the story of the project from its conception to birth, and to (soon) read our report and watch a video of our final presentation.

We could not have built News Mixer without free and open-source software. For those of you who aren’t familiar with the term, this is how the Free Software Foundation describes it:

“Free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer.”

Free software is a matter of the users’ freedom to run, copy, distribute, study, change and improve the software.

The Free Software Definition, Free Software Foundation

Now, journalists in the room might be surprised to hear a nerd talking like this, but the truth is that we’re remarkably similar, journalists and technologists — free software and free speech are the backbone of the web. The Internet runs on free software — from the data center to your desktop.

LAMP =
Linux (operating system) +
Apache (web server) +
MySQL (database) +
Python (teh codez)

I won’t dwell too long on the super-nerdy stuff, but for those interested, News Mixer runs on a LAMP stack, sort of the standard for developing in the open-source ecosystem. Notably missing from the list are non-free technologies you may have heard of like Java, or Microsoft and .NET.

The biggest tech choice we made was to use Django. Its a free and open-source web development framework put together by some very clever folks at The Lawrence Journal-World. For those of you in the know, it’s a framework similar to ASP.NET or the very popular Ruby on Rails, but with a bevy of journalism-friendly features. Django is how we built real, live software so freakin’ fast.

And you can have your very own News Mixer, gratis, right now, because News Mixer is also free and open source. We’ve released our source code under the Gnu General Public License, and it’s available for download right now on Google Code. So, please, stand on our shoulders! We’re all hoping that folks will take what we’ve done, and run with it.

That’s it for part one! Can’t wait and hungry for more? Check out the Crunchberry blog, or my other posts on using free and open source software to practice journalism.