On 2 occasions we have had the issue where we got paged for elevated error rate from ATSBackendErrorsHigh. Both times the problem was from a single pod serving only 5xx.
2024-07-17 10:56: mw-api-ext.eqiad.main-5945d96d65-s87pl
- node: mw1399.eqiad.wmnet
- resources Pod details
- application level pod metrics
- mediawiki errors
- Varnish logs
2024-07-18 07:57: mw-api-ext.eqiad.main-7686884f77-ql69d
- node: kubernetes1041.eqiad.wmnet
- resources Pod details
- application level pod metrics
- mediawiki errors
- slowlog
- Varnish logs
Actions
In both occasions pods were deleted, on 17th manually (via kubectl -n mw-api-ext delete <po>), while on the 18th, the mediawiki train run the pod over.
Observations
- Suddenly (?), apache started reporting only 1 worker busy
- Resources, at a first glance, are not an issue
- POST requests
Open questions
- why k8s didnt delete the pods
- need to figure out if our readiness/liveness probes serve us as they should
- How did the pod get in that state