Arriving back at my desk at 1:30 after a delicious lunch and almond bubble tea, I noticed quite a few tickets in my queue that all had a subject line akin to “OMG SERVER SLOW!!!! ?!??!!?!! DEPTS SLOW!” Checking my email, I noted a few alerts from our service center about slow website response times, and they had just noticed and escalated the issue after noticing a trend.
I quickly fire up our server monitoring page, which is, of course, hosted on these servers it monitors. Without getting into too much of a long and detailed explanation, there’s a legacy reason for this that no longer makes too much sense and I haven’t had time to re-engineer it. But the server monitoring page doesn’t load at first.
“Hmm,” I thought, “this is weird. We’re in the middle of summer quarter, so loads should be at some of the lowest levels of the year.” Additionally, we had just completed an expansion and upgrade of the servers, so they’re much more powerful than our older infrastructure.
At this point, the server monitor opens up, and the results are decidedly pedestrian. No increased loads anywhere, and all of the database hosts look absolutely normal. In other words, the systems seemed to be working perfectly well.
So I go and check Apache’s server-status page, which shows that each server had 100% of its available slots filled but that it was mostly “reading” the request rather than serving them. Now this is even more odd, since usually when we see server issues it’s because of one of the following:
- The database servers are overloaded, and causing PHP threads to hang while waiting for database I/O. This can usually be seen by increased ovid load averages as well as a bunch of waiting php threads. Both of these were absent.
- The file systems are overloaded, which would cause both the database servers and the web servers to have their load shoot up. Both of these were absent.
- A CPU-intensive site is being slammed hard, and threads are piling up. Again, this would show in the server-status page as threads being responded to. Nope.
So at this point I’m thinking “well, I guess this is just a plain old traffic issue” and SSH to one of the servers and tail the log. All of the requests are to an obscure course website that seemed pretty okay. It didn’t seem to be compromised, was just some .html and small images, and otherwise had nothing too CPU or load intensive going for it
Then I saw the referrers. Which were all from reddit. And the light dawned. We were being linked to from a front page reddit post, and were being hit around 120 times a second for this website, which contained lots of images and CSS files. The servers, even though they were responding within a fraction of a second, couldn’t keep up with the load and HTTP requests stacked up, causing failures for all other users on the servers.
We quickly disabled the site in question, and everything went back to normal.
Lesson learned? Not sure there is one here. All but the largest fall before the great reddit. (Of course, reddit will never be as awesome as the superior Slashdot.)