Hi,
matviy wrote:
I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.
This is quite a problem. ECC usually capable of fixing one faulty bit, and that happens transparently. If it can't, that means there are more, even number of errors on the same row or column.
matviy wrote:
However, there doesn't seem to be a way to figure out which data bits specifically are bad.
Sure, if there would be a way, then it would be a correctable error.
matviy wrote:
Any ideas how to figure out which chip is failing?
Well, programaticaly from your OS? You'll need the memory bank information for that, usually found in SMBIOS and ACPI tables.
For Linux, just execute "
dmidecode -t 6" and "
dmidecode -t 17". It will tell you which RAM area corresponds to which memory bank (at least, if you're lucky you will get the exact chip too). You can also read the hardware fault log if your firmware supports that.
Cheers,
bzt