Fri, 18 Nov 2005
Uptime: 92 minutes
Last Saturday I have installed additional 4 GB of memory to Odysseus. I have also recompiled the new kernel and checked the BIOS settings. The kernel booted OK, but when I returned home in the evening, I got SMS that Odysseus has crashed.
I logged in and Odysseus was up, with uptime 1 minute or so. I have cofigured GRUB so that next time it would boot to the older kernel. During the night I received two or three more messages that Odysseus was down. I tried to limit the amount of memory used back to the original 4 GB. Even then Odysseus keeped crashing.
I connected the serial console through script(1) command in order to catch all messages from the possible system crash. Nothing appeared - the server just silently rebooted after some time. I thought about removing the new 4 GB of memory, but the memory has been thoroughly tested in other box for the last two weeks, so it definitely was not bad.
After some time I tried to look up the exact times of system crashes using last(1), and guess what - the server was rebooted after about 92 minutes each time. So including the time for BIOS startup, it seemed that it rebooted after 90 minutes of Linux uptime. I reverted some changes in BIOS setup (disabled ACPI HPET timer, disabled the ACPI 2.0 support, disabled ECC BG scrub and few other ECC settings), and the problem was fixed. I am not sure which settings was the cause of the problem, though, and I don't have time to play with it now.