[go: nahoru, domu]

Page MenuHomePhabricator

Memory upgrade request for prometheus100[56]
Closed, ResolvedPublic

Description

Quote/Hardware Request & Specifications

Hello! Due to the work in T350592, T359633, SLO onboarding and other metrics initiatives we’re seeing significant metrics growth and increases in resource utilization on the prometheus hosts. To help cope we would like to plan a vertical scale up of the eqiad/codfw prometheus hardware hosts in-place, by way of memory upgrades.

To recap the current setup — Today the prometheus systems are running 192G with mixed memory speeds, with dual Xeon(R) Silver 4114 CPUs of max memory speed 2400MHz.

Current memory slot layout:
A1 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
A2 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
A3 32GB DDR4 2666 (HMA84GR7DJR4N-VK or M393A4K40BB1-CRC)

B1 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
B2 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
B3 32GB DDR4 2666 (HMA84GR7DJR4N-VK or M393A4K40BB1-CRC)

We’d like to upgrade each of these hosts to 384G ram each, and at the same time move to a speed matched (for simplicity/compatibility) and balanced DIMM layout for best performance. Based on my understanding of https://www.dell.com/support/manuals/en-us/poweredge-r440/per440_ism_pub/general-memory-module-installation-guidelines?guid=guid-acbc0f13-dedb-492b-a0b0-18303ded565a&lang=en-us this would translate to a final config of:

Proposed memory layout:
6x 32G 3200 DIMMs in slots A1-A6
6x 32G 3200 DIMMs in slots B1-B6

Taking into account the 4x 32G DDR4-3200 sticks already present in the servers — stepwise this is asking for something like (per host):

  • Obtain/purchase 8x 32GB DDR4 3200 DIMMs (of matching spec to the existing 4x DDR4 32GB 3200 DIMMs)
  • Take downtime on one server at a time
  • Remove the 2x DDR4 32GB 2666 DIMMs in slots A3 and B3 (for spares/discard)
  • Install 4 new DDR4 32GB 3200 sticks in slots A3-A6 and
  • Install 4 new DDR4 32GB 3200 sticks in slots B3-B6

Please double check me on this proposal and let me know if I can answer any questions, thanks!

Need By Date

Earliest reasonable date (non-emergency)

Budget Details

Add to Q3/Q4 expendables on the Upcoming Procurement Gsheet.

Refresh / Replacement / Expanding / New Service

Upgrading prometheus hosts in place

Hostname / Racking / Installation Details

Coordinate hardware installation with Keith Herron for scheduling of host downtime.

Quote Review

This section will list/link to each quote for review.

Order Details

This section will be updated to list the order details.

Event Timeline

herron renamed this task from Memory upgrade request for prometheus[12]00[56] to Memory upgrade request for prometheus100[56].Mar 25 2024, 2:27 PM
herron added a project: ops-eqiad.
herron updated the task description. (Show Details)

@herron As it turns out, we currently don't have spare memory at 32Gig DDR4 3200. However, we have plenty of 32Gig DDR4 2666. Would this be an acceptable substitute? Let me know, thanks!

Ah excellent! I thought we would have to order new. Yes in that case lets go ahead with 32Gig DDR4 2666 please. Thank you!

wiki_willy mentioned this in Unknown Object (Task).Mar 29 2024, 4:49 PM

CC from IRC chat -- We've tentatively scheduled this for this Weds afternoon (Eastern TZ, 4/3/2024)

Mentioned in SAL (#wikimedia-operations) [2024-04-03T17:04:43Z] <herron> performing rolling memory upgrades on prometheus100[56] T360687

worked with @herron and added the 32Gig DDR4 2666 to the requested slots. Both servers came back up and reported the correct sizes as expected. Closing this ticket.

Reopening -- today we experienced a memory issue on prometheus1005 which presumably relates to this maintenance. Could we arrange to swap the faulty DIMM outlined in T362990? Thanks in advance!

hey @herron is there a specific time you'd like use to arrange for this activity? Let us know, thanks!

Prometheus1005 is down and depooled, any time works!

@herron B3 DIMM has been replaced and the server should be coming back online. Please check and confirm. Thank you!

Thanks! Looks good!

JFTR I made a backup of the ipmi sel in /root/ipmi-sel.log-20240423 and then cleared the sel for a clean slate from here