Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbering of CPU and bank #84

Open
sm8ps opened this issue Oct 27, 2020 · 2 comments
Open

Numbering of CPU and bank #84

sm8ps opened this issue Oct 27, 2020 · 2 comments

Comments

@sm8ps
Copy link

sm8ps commented Oct 27, 2020

/var/log/mcelog contains the following. This happens on Ubuntu 16.04 reporting the mcelog version as (128+dfsg-1).

Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8 TSC 235983e523450 
MISC 2000000a6646 ADDR 93e6e4300 
TIME 1603741601 Mon Oct 26 20:46:41 2020
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 46
Memory DIMM ID of error: 2
Memory channel ID of error: 2
Memory ECC syndrome: 2000
STATUS 8c0000400001009f MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 44

I have been trying to switch the DIMM in memory bank 8 at CPU 1 as labeled on the motherboard. However, that particular kind of error has been reported again at the same location (CPU 1 BANK 8). Before wildly switching DIMMs around, I am hoping that somebody might be able to tell me what kind of numbering mcelog uses, starting from zero or from one (presumably the same for all kinds of objects).

For comparison, dmidecode uses labels like PROC {1,2} DIMM {1..9} which would make the numbering from one an obvious candidate. However, I have seen examples of mcelog counting the CPUs from zero. As for using both numberings, lshw lists cpu:{0,1} in slot: Proc {1,2} and memory:{0,1} with bank:{0..8} as physical id: {0..9} in PROC {1,2} DIMM {1..9}. Finally, it could even depend on the kind of machine and how its BIOS reports to the kernel.

I have been totally unsuccessful in finding any answer to my question and I am afraid that I would not be any more successful when digging through the code. Can anybody answer this question authoritatively? Thanks in advance for your consideration!

@andikleen
Copy link
Owner

andikleen commented Oct 28, 2020 via email

@sm8ps
Copy link
Author

sm8ps commented Oct 28, 2020

Thanks so much for your answer @andikleen!

Thus the value CPU 1 in mcelog corresponds to the core listed as /sys/devices/system/cpu/cpu1/, right? Unfortunately the sub-directory topology/ contains only the files

core_id  core_siblings  core_siblings_list  physical_package_id  thread_siblings  thread_siblings_list

So I am still trying to find the right socket. The value in physical_package_id is always 1 in all the cpu#-directories corresponding to the values listed in core_siblings_list and is always 0 in all the other cpu#-directories. That seems to refer to the socket numbering, right?

All in all I would guess that it is socket nr. 1 out of (0,1) which seems to correspond to the value SOCKETID 1 in mcelog (which I had overlooked). The information for the motherboard mentions processor sockets 1 and 2 so I think it should be the second one. Does that sound right?

Next to identify the faulty DIMM module! Is the information Memory DIMM ID of error: 2 in mcelog the one I am looking for? I was at first convinced that it had to be nr. 8 as in (memory) BANK 8.

So in conclusion I should replace DIMM nr. 3 (out of 1..9) connected to socket 2 (of of 0..1), right?

Sorry for these very simple questions! This is the very first time that I have to dive into such matters and they have got me quite a bit confused. May others find the answers helpful, too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants