Misbehaving mw-api-ext pods serving 5xx
Open, LowPublic
Actions

Assigned To

None

Authored By

	jijiki
	Jul 18 2024, 3:04 PM

Description

On 2 occasions we have had the issue where we got paged for elevated error rate from ATSBackendErrorsHigh. Both times the problem was from a single pod serving only 5xx.

2024-07-17 10:56: mw-api-ext.eqiad.main-5945d96d65-s87pl

2024-07-18 07:57: mw-api-ext.eqiad.main-7686884f77-ql69d

Actions

In both occasions pods were deleted, on 17th manually (via kubectl -n mw-api-ext delete <po>), while on the 18th, the mediawiki train run the pod over.

Observations

Suddenly (?), apache started reporting only 1 worker busy
Resources, at a first glance, are not an issue
POST requests

Open questions

why k8s didnt delete the pods
- need to figure out if our readiness/liveness probes serve us as they should
How did the pod get in that state

Event Timeline

jijiki triaged this task as High priority.Jul 18 2024, 3:04 PM

jijiki created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2024, 3:04 PM

jijiki updated the task description. (Show Details)Jul 18 2024, 3:05 PM

herron subscribed.Jul 18 2024, 3:06 PM

Scott_French subscribed.Jul 18 2024, 5:06 PM

In both cases, workers start failing with SIGILL at the start of badness, e.g. (from mw-api-ext.eqiad.main-7686884f77-ql69d):

[18-Jul-2024 06:23:21] WARNING: [pool www] child 8 exited on signal 4 (SIGILL) after 32835.333263 seconds from start

Basically, all the long-lived workers fail with SIGILL, and then all subsequently spawned workers do as well fairly promptly (roughly matching the rate of 500s we see in the envoy metrics).

That smells like persistent memory corruption ... which as a new-to-PHP person I find surprising. I've seen folks refer to opcache as something to look at here, and now that I see what that is, it certainly at least sounds like a vector for behavior like this.

In T370425#9996695, @Scott_French wrote:
In both cases, workers start failing with SIGILL at the start of badness, e.g. (from mw-api-ext.eqiad.main-7686884f77-ql69d):
[18-Jul-2024 06:23:21] WARNING: [pool www] child 8 exited on signal 4 (SIGILL) after 32835.333263 seconds from start
https://logstash.wikimedia.org/goto/1ed07a0734e6126c8ba5a26cfc5feff8

https://logstash.wikimedia.org/goto/4549e74c58937651c5c21ad2de3d42f9

Basically, all the long-lived workers fail with SIGILL, and then all subsequently spawned workers do as well fairly promptly (roughly matching the rate of 500s we see in the envoy metrics).

That smells like persistent memory corruption ... which as a new-to-PHP person I find surprising. I've seen folks refer to opcache as something to look at here, and now that I see what that is, it certainly at least sounds like a vector for behavior like this.

Interesting! Thank you for digging a wee bit further!

From Thu till today I have not observed the issue again, so I suggest we wait and see if it happens again

The SIGILL thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing the shared anonymous memory and the related semaphores, so I guess one of apcu and opcache are responsible. I'm starting to think we might need a liveness probe of some kind for the pod to depend on that?

In T370425#10002480, @Joe wrote:

The SIGILL thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing the shared anonymous memory and the related semaphores, so I guess one of apcu and opcache are responsible. I'm starting to think we might need a liveness probe of some kind for the pod to depend on that?

We could consider it, given we have gone more than once in opcache/apcu rabitholes. What is on your mind?

akosiaris subscribed.Jul 23 2024, 4:55 PM

Krinkle moved this task from Untriaged to Jul 2024 on the Wikimedia-production-error board.Jul 25 2024, 3:17 PM

Lowering priority as for the time being the impact is minor. Will raise again if we observe more occurrences

Misbehaving mw-api-ext pods serving 5xxOpen, LowPublicActions