User Details
- User Since
- Oct 3 2014, 8:06 AM (510 w, 6 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Yesterday
I captured otelcoll traffic on mw2310 (looks like it is always and only otel-coll running on this host that causes problems) in the hopes to understand what's the problematic trace. I've left /root/otelcoll-T370043.cap in place in case it is of interest; I could decode the http2 traffic with wireshark (select traffic with source port 4317 then "decode as" http2) though I couldn't get wireshark to decode the grpc traffic.
I'm investigating this again today, and yes there's a single otel-coll pod in codfw that is problematic: https://w.wiki/AhDr
I'm tentatively resolving this since even though the topic isn't perfectly balanced I'm calling it good enough™. We can revisit as needed
Thank you all for the clarification, I'm glad we're able to do the move without re-addressing after all!
Wed, Jul 17
So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for ssh" and data comes from netbox, specifically these bits in modules/profile/manifests/prometheus/ops.pp:
I'm removing grafana since turns out this issue is unrelated
I have reached out to upstream at https://github.com/redpanda-data/connect/issues/2705 to make them aware, in case they have seen this before. I'm not overly concerned just yet in the sense that IIRC this is the first time we've seen such a failure and it could be a unlucky coincidence
There's similar logs in benthos@webrequest_live on centrallog1002:
Also this in logs around the time consuming stopped:
Tue, Jul 16
This is done! We went with the procedure I suggested above, namely I took the host side configuration by logging back in via console to adjust the network configuration with the new interface name. @Jhancock.wm did the host physical move by connecting everything to its final location, and @Papaul did the netbox + network part
Perhaps unsurprisingly, the problem is back only on a different pod. Which in my mind excludes a otel-coll problem and points to a client sending troublesome spans
A very low hanging fruit to make progress on this task is the following prometheus-based checks:
Thank you @Papaul that is quite helpful!
I'm tentatively resolving this since the increase in partitions in T369256 has helped with increased concurrency
I'm happy to report that increasing partition count and restarting benthos mw sampler did the trick, lag was reduced even before the restart and completely gone after restart:
This is ongoing, to exclude a otel-coll problem I've deleted main-opentelemetry-collector-agent-msvrc pod in wikikube codfw. Let's see if it happens again
Mon, Jul 15
Thank you @LSobanski ! I'll be reaching out to the individual service owners
Fri, Jul 12
Today I've done extensive tests and tweaking of benthos@webrequest_live settings, namely:
Thank you @Jhancock.wm that's great! Please LMK a day and time of next week that would work for you
Thu, Jul 11
A bit of context: benthos@webrequest_live normally runs on centrallog1002 and centrallog2002 consuming the webrequest firehose from jumbo-eqiad. While at steady state this is more or less fine, on sudden spikes of messages codfw consumer struggles with clearing its backlog / lag.
I've taken a look at this as well for the benthos bits, and there was significant kafka lag for the benthos consumer group at the time, which would explain the delay in data
Thank you @EUwandu-WMF ! Everything checks out, you are now part of wmf group and therefore have access. I'm tentatively resolving, though please reopen as needed
Patch is merged and I've added soda to nda ldap group, tentatively resolving though please reopen if needed
Wed, Jul 10
Tue, Jul 9
Calling this done, Pontoon now supports Puppet 7 (puppetserver) and I've updated the wikitech documentation
This is done (the monitoring project is no longer a thing)
No worries at all and totally fair @Aklapper !
Definitely +1 to use recording rules for histograms, since they are broken anyways already. I think also switching the istio dashboard to thanos might help, as there is caching within thanos (the stateless part, thanos-query-frontend and thanos-query).
@Lferreira you are now part of the wmf ldap group, I'm optimistically resolving the task, though please reopen if sth is amiss!
You should be able to access the dashboards in ~30 min from now, please confirm that is the case.
Mon, Jul 8
Hello @XiaoXiao-WMF; I've added you to wmf ldap group. I'm tentatively resolving the task though please reopen if sth is amiss
Hello @EUwandu-WMF, I couldn't find the uniquemia account on wikitech, or at least one with euwandu-ctr@wikimedia.org as its email, what wikitech account should we be using? thank you!
Fri, Jul 5
There is actually a mwlog02 bullseye instance which is already active, I've proceeded to delete mwlog01
Got it now, thank you @Marostegui. My recommendation in this case is to use mtail to get metrics out of logs; where such parsing should happen we can decide. My recommendation is to have mtail on db hosts that tails log files, this way you don't depend on: rsyslog delivering to centrallog, centrallog working, etc. There are some examples in puppet already for roles that use mtail (e.g. hieradata/role/common/mail/mx.yaml) which should be a good starting point, we (o11y) are of course happy to help too.
I think I'm missing context here, would you mind expanding on what are you trying to achieve?
Thu, Jul 4
I suspect this is related to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic; we'll know more once that task is done