[go: nahoru, domu]

Page MenuHomePhabricator

hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

an-worker1085.eqiad.wmnet

Urgency: Medium-High; This host is part of hadoop which is a very important data platform service, however like Elasticsearch the Hadoop cluster can handle the occasional node failure.

Note: This host is out of warranty but we're hoping there's a compatible memory module that can be scrounged from somewhere.

Box checking
Issue

racadm getsel on the DRAC shows the following:

Record:      1021
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1022
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      1023
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1024
Date/Time:   06/12/2024 21:48:15
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.

There are also some other less critical messages about DIMM_B1 stretching back a couple years:

-------------------------------------------------------------------------------
Record:      1014
Date/Time:   12/04/2022 18:31:22
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1015
Date/Time:   12/04/2022 23:09:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1016
Date/Time:   08/09/2023 15:27:32
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1017
Date/Time:   08/09/2023 17:09:53
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1018
Date/Time:   09/30/2023 14:01:49
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      1019
Date/Time:   10/09/2023 03:31:44
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------

Event Timeline

RKemper updated the task description. (Show Details)

@RKemper When is there a preference on when we could schedule this?

@RKemper When is there a preference on when we could schedule this?

Whenever's convenient. As long as we're only taking one* hadoop node offline at a time the system will be resilient to failure. Ideally one day's notice would be ideal so we can downtime the host so it won't generate alert noise.

* The system can handle at least 2 in fact

Hey @RKemper would Thursday work for you? Around 12:00 EST?

Hey @RKemper would Thursday work for you? Around 12:00 EST?

@VRiley-WMF Sounds great!

Host has been downtimed. Accidentally associated to wrong ticket: https://phabricator.wikimedia.org/T367825#9908323

Swapped out B1 with another compatible DIMM and the unit should be coming back online.