hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	RKemper
	Jun 13 2024, 4:25 PM

Description

an-worker1085.eqiad.wmnet

Urgency: Medium-High; This host is part of hadoop which is a very important data platform service, however like Elasticsearch the Hadoop cluster can handle the occasional node failure.

Note: This host is out of warranty but we're hoping there's a compatible memory module that can be scrounged from somewhere.

Box checking

Machine not part of a pool
Has been marked as failed in Netbox: https://netbox.wikimedia.org/dcim/devices/1969/

Issue

racadm getsel on the DRAC shows the following:

Record:      1021
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1022
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      1023
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1024
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.

There are also some other less critical messages about DIMM_B1 stretching back a couple years:

-------------------------------------------------------------------------------
Record:      1014
Date/Time:   12/04/2022 18:31:22
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1015
Date/Time:   12/04/2022 23:09:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1016
Date/Time:   08/09/2023 15:27:32
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1017
Date/Time:   08/09/2023 17:09:53
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1018
Date/Time:   09/30/2023 14:01:49
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1019
Date/Time:   10/09/2023 03:31:44
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------

Related Objects

Mentioned Here: T367825: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093

Event Timeline

RKemper created this task.Jun 13 2024, 4:25 PM

RKemper updated the task description. (Show Details)

Maintenance_bot added a project: SRE.Jun 13 2024, 4:29 PM

@RKemper When is there a preference on when we could schedule this?

VRiley-WMF claimed this task.Jun 13 2024, 5:47 PM

VRiley-WMF moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

VRiley-WMF added a subscriber: Jclark-ctr.

In T367442#9890151, @VRiley-WMF wrote:

@RKemper When is there a preference on when we could schedule this?

Whenever's convenient. As long as we're only taking one* hadoop node offline at a time the system will be resilient to failure. Ideally one day's notice would be ideal so we can downtime the host so it won't generate alert noise.

* The system can handle at least 2 in fact

Hey @RKemper would Thursday work for you? Around 12:00 EST?

In T367442#9903813, @VRiley-WMF wrote:

Hey @RKemper would Thursday work for you? Around 12:00 EST?

@VRiley-WMF Sounds great!

Host has been downtimed. Accidentally associated to wrong ticket: https://phabricator.wikimedia.org/T367825#9908323

Swapped out B1 with another compatible DIMM and the unit should be coming back online.

VRiley-WMF closed this task as Resolved.Jun 20 2024, 7:06 PM