[go: nahoru, domu]

Page MenuHomePhabricator

Misbehaving mw-api-ext pods serving 5xx
Open, LowPublic

Description

On 2 occasions we have had the issue where we got paged for elevated error rate from ATSBackendErrorsHigh. Both times the problem was from a single pod serving only 5xx.

2024-07-17 10:56: mw-api-ext.eqiad.main-5945d96d65-s87pl
2024-07-18 07:57: mw-api-ext.eqiad.main-7686884f77-ql69d
Actions

In both occasions pods were deleted, on 17th manually (via kubectl -n mw-api-ext delete <po>), while on the 18th, the mediawiki train run the pod over.

Observations
  • Suddenly (?), apache started reporting only 1 worker busy
  • Resources, at a first glance, are not an issue
  • POST requests
Open questions
  • why k8s didnt delete the pods
    • need to figure out if our readiness/liveness probes serve us as they should
  • How did the pod get in that state

image.png (1×2 px, 247 KB)

image.png (618×1 px, 81 KB)

Event Timeline

jijiki triaged this task as High priority.Jul 18 2024, 3:04 PM
jijiki created this task.

In both cases, workers start failing with SIGILL at the start of badness, e.g. (from mw-api-ext.eqiad.main-7686884f77-ql69d):

[18-Jul-2024 06:23:21] WARNING: [pool www] child 8 exited on signal 4 (SIGILL) after 32835.333263 seconds from start

Basically, all the long-lived workers fail with SIGILL, and then all subsequently spawned workers do as well fairly promptly (roughly matching the rate of 500s we see in the envoy metrics).

That smells like persistent memory corruption ... which as a new-to-PHP person I find surprising. I've seen folks refer to opcache as something to look at here, and now that I see what that is, it certainly at least sounds like a vector for behavior like this.

In both cases, workers start failing with SIGILL at the start of badness, e.g. (from mw-api-ext.eqiad.main-7686884f77-ql69d):

[18-Jul-2024 06:23:21] WARNING: [pool www] child 8 exited on signal 4 (SIGILL) after 32835.333263 seconds from start

Basically, all the long-lived workers fail with SIGILL, and then all subsequently spawned workers do as well fairly promptly (roughly matching the rate of 500s we see in the envoy metrics).

That smells like persistent memory corruption ... which as a new-to-PHP person I find surprising. I've seen folks refer to opcache as something to look at here, and now that I see what that is, it certainly at least sounds like a vector for behavior like this.

Interesting! Thank you for digging a wee bit further!

From Thu till today I have not observed the issue again, so I suggest we wait and see if it happens again

The SIGILL thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing the shared anonymous memory and the related semaphores, so I guess one of apcu and opcache are responsible. I'm starting to think we might need a liveness probe of some kind for the pod to depend on that?

The SIGILL thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing the shared anonymous memory and the related semaphores, so I guess one of apcu and opcache are responsible. I'm starting to think we might need a liveness probe of some kind for the pod to depend on that?

We could consider it, given we have gone more than once in opcache/apcu rabitholes. What is on your mind?

jijiki lowered the priority of this task from High to Low.Jul 26 2024, 8:07 AM

Lowering priority as for the time being the impact is minor. Will raise again if we observe more occurrences