Deconstructing the FeedLounge Downtime

If this were Slashdot, I’d file this under the so-meta-it-hurts department. It’s not, though. It’s IJSM.org, which means that the audience is smaller, “FR1ST P50T!” is a rarity, and Natalie Portman isn’t pouring hot grits down anyone’s pants. [See? Even my Slashdot jokes are three years old.]

Anyhow, so FeedLounge had some downtime, related to a minor oversight that ended up being a colossal Charlie-Foxtrot: Alex’s server was, for a bunch of understandable but inexcusable reasons, the single-point-of failure for DNS authority for FeedLounge.com. When Alex’s server burned up, it all went to shit—FeedLounge the server was running fine, but no one could reach it. The DNS system was unable to route around the damage because there was a single point of failure.

I think Alex is pretty clear, in retrospect, that he didn’t address this well enough the first time, so he did, as he termed it, an “O’Grady style Q&A” to answer the questions regarding FeedLounge’s outage. I want to slice-and-dice to the relevant parts:

Why didn’t you also check this for feedlounge.com?

Unfortunately, the answer is really simple - we forgot. We moved the feedlounge.com web site to a dedicated server in a data center in New Jersey last summer around the same time we move the ‘Lounge onto our big servers in our rack space in San Francisco.

Since feedlounge.com was no longer on boxes at the Austin data center, we didn’t think to check the DNS records for feedlounge.com.

Do you now feel that was monumentally stupid?

Um, yeah. And then some.

Are you saying that if this had happened a week ago, it wouldn’t have caused any trouble?

Most likely, yes. We’d have replaced the fried box just like we’ve done, but the backup DNS servers would have shouldered the load while we did so.

Sounds like you guys should have paid more attention to this.

Agreed. Lesson learned - the hard way.

So what do you do now?

Besides offering an apology to our users, there isn’t much we can do. We have to wait for the changes we’ve already made to take effect.

I’m not satisfied or happy about this.

Trust me, we’re not either.

Ok, now what?

It was a bad day for FeedLounge. With apologies to our users, we fix the problem and move forward. As we do, we’ll continue to work hard to make FeedLounge as reliable as possible and continue building the features our users are asking for.

That’s really all they can do. The guys just learned a painful—and, probably, unprofitable—lesson. We’re at the “mistakes were made” portion of the program. As a user of the system, it’s easy for me to be upset—and yes, I was inconvenienced by this. But I talked to Alex (and some to Scott) while it was on-going, and so I knew what the problems were and that it would be a pain to fix them. DNS is a royal pain in my ass, and downtime is, too.

While there’s room to be disappointed in the early responses given—telling users to hack their DNS isn’t a good solution; I’m reasonably handy with computers, and I didn’t know how to do it straight off—those are simply the responses that you unthinkingly give when you’re reacting rather than acting. To be fair to the guys, they had a lot of reacting to do—”I’ve tried A! I’ve tried B!” Sometimes you have to do what Deke Slayton—at least I think it was him—referred to as “the JC maneuver”: “Take your hands off the controls and put it in the hands of a su-per-nat-ur-al pow-er.”

Some days, you’re the cow, and some days, you’re the pasture. FeedLounge was definitely the pasture recently, but remember … fertilized pasture grows really good grass. :)

Posted May 23rd, 2006 in FeedLounge.

2 comments:

  1. alexking.org: Blog > Around the web:

    [...] Deconstructing the FeedLounge Downtime [...]

  2. Geof F. Morris's Indiana Jones School of Management:

    FeedLounge.info

    Remember when I gave the FeedLounge guys a little bit of crap for their recent downtime? The next morning, I shot Alex and Scott an email:
    Something I meant to write about but didn’t: I might suggest registering FeedLounge.info as a wholly off-s…

Leave a response:

Note: This post is over 2 years old. You may want to check later in this blog to see if there is new information relevant to your comment.