OSDev.org

The Place to Start for Operating System Developers
It is currently Fri Mar 29, 2024 5:41 am

All times are UTC - 6 hours




Post new topic Reply to topic  [ 2 posts ] 
Author Message
 Post subject: Determining the failing bits of an uncorrectable ECC error.
PostPosted: Thu Sep 19, 2019 1:45 pm 
Offline

Joined: Thu Sep 19, 2019 1:23 pm
Posts: 1
I'm running a system with an x86 AMD CPU, that has a 64-bit data bus + 8-bits of ECC, so 72-bits total. I have 9 x8 DRAMs on my DIMM.

I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.

However, there doesn't seem to be a way to figure out which data bits specifically are bad. The Syndrome is only valid for correctable errors, so it doesn't map to anything on the ECC syndrome lookup table.

The system uses 4-bit symbol BCH ECC that is single-symbol-correcting and dual-symbol-detecting. So since it's uncorrectable, I'm getting errors across at least two symbols. In fact, I'm fairly sure that it's an entire DRAM (8-bits) that's completely dying.

If the AMD CPU somewhere had a register that contained the 72-bits read from memory that caused the error, i could probably look at it to see which bits are obviously wrong. But there doesn't seem to be such a register anywhere.

The datasheet for this CPU is here: https://www.amd.com/system/files/TechDo ... h_BKDG.pdf

Any ideas how to figure out which chip is failing?

EDIT: To be specific, the reason i need to know which chip is failing is because they're all soldered to the main board.


Top
 Profile  
 
 Post subject: Re: Determining the failing bits of an uncorrectable ECC err
PostPosted: Sun Sep 22, 2019 9:06 am 
Offline
Member
Member
User avatar

Joined: Thu Oct 13, 2016 4:55 pm
Posts: 1584
Hi,

matviy wrote:
I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.
This is quite a problem. ECC usually capable of fixing one faulty bit, and that happens transparently. If it can't, that means there are more, even number of errors on the same row or column.

matviy wrote:
However, there doesn't seem to be a way to figure out which data bits specifically are bad.
Sure, if there would be a way, then it would be a correctable error.

matviy wrote:
Any ideas how to figure out which chip is failing?
Well, programaticaly from your OS? You'll need the memory bank information for that, usually found in SMBIOS and ACPI tables.

For Linux, just execute "dmidecode -t 6" and "dmidecode -t 17". It will tell you which RAM area corresponds to which memory bank (at least, if you're lucky you will get the exact chip too). You can also read the hardware fault log if your firmware supports that.

Cheers,
bzt


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Bing [Bot], cloudapio, DotBot [Bot], Majestic-12 [Bot] and 124 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group