-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numbering of CPU and bank #84
Comments
On Tue, Oct 27, 2020 at 08:59:02AM -0700, sm8ps wrote:
/var/log/mcelog contains the following. This happens on Ubuntu 16.04 reporting
the mcelog version as (128+dfsg-1).
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8 TSC 235983e523450
MISC 2000000a6646 ADDR 93e6e4300
TIME 1603741601 Mon Oct 26 20:46:41 2020
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 46
Memory DIMM ID of error: 2
Memory channel ID of error: 2
Memory ECC syndrome: 2000
STATUS 8c0000400001009f MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
I have been trying to switch the DIMM in memory bank 8 at CPU 1 as labeled on
the motherboard. However, that particular kind of error has been reported again
at the same location (CPU 1 BANK 8). Before wildly switching DIMMs around, I am
hoping that somebody might be able to tell me what kind of numbering mcelog
uses, starting from zero or from one (presumably the same for all kinds of
objects).
From zero.
Actually CPUs could be offlined. In that case there would be holes.
For comparison, dmidecode uses labels like PROC {1,2} DIMM {1..9} which would
make the numbering from one an obvious candidate. However, I have seen examples
of mcelog counting the CPUs from zero. As for using both numberings, lshw lists
cpu:{0,1} in slot: Proc {1,2} and memory:{0,1} with bank:{0..9} as physical id:
{0..9} in PROC {1,2} DIMM {1..9}. Finally, it could even depend on the kind of
machine and how its BIOS reports to the kernel.
mcelog uses the same numbering as Linux, which in term depends on the
BIOS and the machine. Also these CPUs are of course cores and threads,
while any labels on the motherboard would refer to sockets.
There is socket (and other) mapping to topology in /sys/devices/system/cpu/cpuX/topology/
However it depends on the motherboard and BIOS if that corresponds
to the motherboard levels.
…-Andi
|
Thanks so much for your answer @andikleen! Thus the value
So I am still trying to find the right socket. The value in All in all I would guess that it is socket nr. 1 out of (0,1) which seems to correspond to the value Next to identify the faulty DIMM module! Is the information So in conclusion I should replace DIMM nr. 3 (out of 1..9) connected to socket 2 (of of 0..1), right? Sorry for these very simple questions! This is the very first time that I have to dive into such matters and they have got me quite a bit confused. May others find the answers helpful, too! |
/var/log/mcelog contains the following. This happens on Ubuntu 16.04 reporting the mcelog version as (128+dfsg-1).
I have been trying to switch the DIMM in memory bank 8 at CPU 1 as labeled on the motherboard. However, that particular kind of error has been reported again at the same location (CPU 1 BANK 8). Before wildly switching DIMMs around, I am hoping that somebody might be able to tell me what kind of numbering mcelog uses, starting from zero or from one (presumably the same for all kinds of objects).
For comparison, dmidecode uses labels like
PROC {1,2} DIMM {1..9}
which would make the numbering from one an obvious candidate. However, I have seen examples of mcelog counting the CPUs from zero. As for using both numberings, lshw listscpu:{0,1}
inslot: Proc {1,2}
andmemory:{0,1}
withbank:{0..8}
asphysical id: {0..9}
inPROC {1,2} DIMM {1..9}
. Finally, it could even depend on the kind of machine and how its BIOS reports to the kernel.I have been totally unsuccessful in finding any answer to my question and I am afraid that I would not be any more successful when digging through the code. Can anybody answer this question authoritatively? Thanks in advance for your consideration!
The text was updated successfully, but these errors were encountered: