Long day yesterday with the Columbia Missourian site up and down, but mostly down.
Here’s the basic explanation of what happened from our programmer, Noah Medling:
Our website runs on four physical servers: one for media (photos, videos, etc.), one for the database (articles, user permissions, almost everything textual in nature), and two webservers that take that information and assemble it into what you see in the browser.
There’s a load balancer that sits between you and the two webservers that alternately sends requests to one or the other, spreading the work across both as evenly as it can. If one of those servers stops functioning, it stops directing traffic to it and we run off of just one webserver until the other comes back online.
The load balancer was using HTTP response time to determine if a server was working or not. Our website is, as you probably know, not very fast, but they’re usually not so bad that the load balancer complains. This morning, around 8:00, one of the webservers stopped being responsive. The load balancer steered all of our traffic to the other one, which was now handling everything rather than just half, and we managed to get record amounts of traffic at the same time. This slowed down the remaining server to unacceptable levels and the load balancer took it offline, leaving us with no website.
I got here at 8:30 and worked on the problem constantly from then until it was finally resolved around 4:00. I never did figure out what happened to the first webserver, but I rewrote a sizable chunk of configuration code in resolving other issues, so it’s likely that whatever caused that problem in the first place no longer exists. Because the load balancer is outside of my control, it took a long time to discover that it was part of the problem. It has now been reconfigured to no longer take servers offline for being slow, and Rob and I now have contact information for a number of people in networking and CSG who have access to the parts that I don’t, and can help in getting this sort of thing resolved much faster in the future.
On top of that, we had a couple of other issues once it came back late yesterday:
Because the site was down pretty much all day, all its caches expired, so it took forever to re-cache items, leading to further slowness.
We were also being linked to from the Drudge Report, which was quite possibly the source of some of the heavy traffic from Monday morning (the story about cotton balls being dropped at the Black Culture Center on the MU campus).
At around 4:30 on Monday, for example, we had about 10,000 concurrent connections (our previous record was 7,000) to the site, which were a combination of Drudge traffic, search engines/bots and regular connections.
The good news, I suppose, is that our site was handling the traffic without completely crumbling. The bad news, of course, is that it was running ridiculously slowly in the meantime.
Anyhow, we’re slowly recovering and will hopefully have things running more speedily as time goes on. Thanks for being patient in the meantime.
Recent Comments