Mon, 31 Oct 2005
Power failure, storage failures, IBM support failures
We had a power outage last Wednesday, apparently caused by a faulty breaker or something like that. Even UPSes and a generator were not able to bridge over the failure, so the whole server room went down.
Aside from the usual problems like "this server booted up before that one, so that this service was not working", two hard drives went faulty after the power outage (one on Thursday, and the second one on Friday). I suspect they were faulty immediately after the power outage, just the storage array discovered the failure while running some kind of internal tests on Thursday, or Friday, respectively.
So, we have met the IBM support again. We have bought the storage from some IBM reseller, and they claimed they can handle the support for us. However, we experience the same problems every time we try to handle a disk failure:
- We: "We have a drive failure in our storage array in the slot #5.".
- Reseller hotline: "Send us the serial number of the array." (or the drive, or the enclosure, or something like that).
- We: "We have a single array from you. You definitely know the serial number yourselves. Why should we repeat this every time? Why are you delaying the fix of the problem by asking this again? But OK, here is the number you have requested."
- After few hours, the call from IBM support: "We have a report about the drive failure in the storage array, can you send me a serial number of the array?"
- We: "We have already sent it to your reseller, why they did not report it to you? But nevertheless, here is the serial number."
- Few hours later, IBM support calls back: "OK, we are sending the new drive to you, you should have it by the next day or maybe the day after."
WTF? Why the reseller cannot communicate with IBM themselves as they promised? And what is worse, it seems that the reseller or the IBM hotline demand different parameters of our storage array every time - sometimes it is the serial number of the array itself, or the serial number of the drive, or the entire storage array profile, and the last one was some IBM part number which is even not visible remotely from the storage manager, and it is just printed on the array itself (so we had to walk to the array, which is located in a remote server room). The support of SGI is definitely better, altough we had a similar problem last time (they demanded something called the "revision number" of the faulty drive, which is only printed on the drive itself and cannot be read remotely by the storage manager).