Yesterday at 5:20pm, the same server the died a few weeks ago died again. We went to investigate this morning and it had the same exact problem: the CPU fan quit causing the CPU to overheat and the machine to shut down.
So we just sat there on the floor of our co-location center and thought about our options. Thanksgiving is coming next week which is the 2nd most-trafficked (Christmas is #1) on the site so we wanted to have all servers running just in case something bad happens to one of them. That means it’s not enough time to just order a replacement part and wait for it to arrive. We thought to call our friends at Silicon Mechanics, who helped us out in a pinch before. Given that this wasn’t one of their machines, they had every right to refuse to help (after all, from personal experience, Dell doesn’t help you when you have a problem with one of their machines!). But when we called them, they were more than happy to help and said to bring it over.
We arrived, interrupting their lunch time, they opened it up, I explained the problem and they gave it a once-over. This is a machine I built myself, and they knew it, but politely pointed out my errors. They cleaned off the gallon too much of thermal paste that I put on it and put in a new (hopefully better) CPU fan. But they didn’t stop there. They noticed that the CDROM and floppy drives were not hooked up (cuz I didn’t have a cable long enough to reach across the case) so they made a custom cable and installed it so they worked again. Then they sliced the flat ribbon cables and wrapped them up to improve airflow through the case. And they tied up all the loose cables that I left in there. Just amazing service!
We took the machine back, popped it back in the rack and powered it up. A repaired server in under 3 hours!