HOW TO: Hurt your server really bad
I am sure you can think of quite a few ways to do it. Here is mine: use the X509 certificate based authentication in a web app. Isn't it simple?
Here is what happened: after through testing we deployed our app to a farm of W2003 servers. It worked really well for a while. In a couple of months one of the servers started to slow down - the response time went up. We even started to see occasional timeouts in the logs. We looked at everything - CPU utilization, memory consumption, Windows handles - nothing to put your finger on. Meanwhile the server fell to its knees - 50% of the requests were timing out. Even to RDP into the server was taking up to 2 minutes. What's interesting - once you are in the server seemed to be pretty responsive, unless you try to start a program - launching notepad was taking up to 1 min. And again once it is up - it works fine. What scared us even more is that other servers in the farm started to deteriorate in the same way.
To solve the mystery we had to take a memory dump from production server. After looking at the dump we started to suspect that the problem has to do with the fact that some of our single sign on users are authenticated using X509 certificates. After some digging we stumbled upon this article . And that's where all the pieces finally fell into their places. Here is what's happening;
- Apparently when validating an x509 certificate windows crypto module creates a temporary file in the Windows temp directory. I do not see any reason why an expense of creating a temp file would be necessary, but as stupid as it is, in itself this is not a problem.
- What is a problem is a bug described in the article I mentioned. As explained in the article, as a result of this bug the temp file(s) created during x509 validation are not deleted. No big deal - just a few 0 length files nobody cares about - right?
- Well, right - at least for some time. The problem is that the files keep piling up. When the number of files in the temp directory is low - below 65000, everything works fine, but when this number is exceeded, as stated in the article "the computer can experience significant delays". In our case it took our servers around 4 months to hit this number - and that is where we started to see these problems.
The solution was simple - just get rid of the files and everything goes back to normal. There is a hotfix available for the bug, but for some reason it is not included in the latest service pack. Because of this we decided against applying the hotfix. Instead we just configured a daemon job cleaning up temp directories overnight.