[go: nahoru, domu]

Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Projects (11)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (303 w, 6 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Fri, Aug 23

CDanis updated the task description for T373230: auto-create grafana annotations from VictorOps pages.
Fri, Aug 23, 9:56 PM · Patch-For-Review, observability
CDanis updated the task description for T373230: auto-create grafana annotations from VictorOps pages.
Fri, Aug 23, 9:56 PM · Patch-For-Review, observability
CDanis triaged T373230: auto-create grafana annotations from VictorOps pages as Medium priority.
Fri, Aug 23, 9:54 PM · Patch-For-Review, observability
CDanis added a comment to T373230: auto-create grafana annotations from VictorOps pages.

Example invocation of the WIP implementation:

klaxon/victorops.py -v incidents_to_annotations -a 2024-07-01  -b 2024-08-01 \
  | xargs -d$'\n' -I@ curl -H "Authorization: Bearer $GRAFANA_SVC_ACCT_TOKEN" -H 'content-type: application/json' --data '@' https://grafana.wikimedia.org/api/annotations -s
Fri, Aug 23, 9:54 PM · Patch-For-Review, observability
CDanis created T373230: auto-create grafana annotations from VictorOps pages.
Fri, Aug 23, 9:51 PM · Patch-For-Review, observability
CDanis added a comment to T373189: Establish a proper process for repacing kafka nodes.

Just FTR -- we didn't actually saturate anything at all on the Kafka hosts. The hottest that any NIC was was about 70% of line rate, which is warm but not hot. No drops or errors on the switch ports, nic-saturation-exporter metrics show that we weren't micro-bursting above 800mbit/s except for a one-minute interval (and nic-sat says were below 900mbit/s during then). Disk i/o queues were fine, all the other NICs were fine, and it wasn't some goofy thing where we were saturating core 0 or anything on kafka-main2006 either (link).

Fri, Aug 23, 8:05 PM · serviceops
CDanis added a comment to T372507: Prepare WMF PHP 8.1 packages for Bullseye.

In puppet, this comment [2] would suggest it's only needed for fundraising use cases. That's a little surprising, as we wouldn't expect that to be used in a fundraising context (although I'm told there are cases where "fundraising" colloquially can refer to specific functionality in, e.g., CentralNotice).

Fri, Aug 23, 7:40 PM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops
CDanis added a comment to T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes.
Concrete proposal: monitor client connections as a load indicator for masters

I really like this idea, we'd need help from IF though but I am happy to help as much as I can. I don't think it will have a huge impact on the performance, but we can start trying them on misc masters. Ideally we should have this tooling/metric everywhere, not just masters. However, the amount of writes compare with reads is way lower, so we'd need to measure the impact on reads as well.
If we prefer to have this only on masters for performance reasons, we should make this controlled by puppet and the master role, which we assign to the hiera file of each master, so it gets enabled/disable when we promote/demote a master.
We can of course explore the possibility to enable it for reads too.

Fri, Aug 23, 7:14 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis added a comment to T317799: Rate limiting for hotlinked images.
11:47:53	<vgutierrez>	hmmm if those are megabytes my intention was to set it to 300 / 8 = 37.5
11:48:00	<cdanis>	ahahaha
11:48:03	<cdanis>	that ALSO sounds good to me
11:48:25	<vgutierrez>	traffic shaping in the cp servers doesn't allow a single stream to go over 300mbps IIRC
Fri, Aug 23, 3:53 PM · Patch-For-Review, SRE-Sprint-Week-Sustainability-March2023, Traffic, Sustainability (Incident Followup)
CDanis moved T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata from Blocked/waiting to Inbox, needs triage on the MediaWiki-Platform-Team board.
Fri, Aug 23, 3:43 PM · MediaWiki-Engineering, MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing
CDanis added a project to T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata: MediaWiki-Engineering.
Fri, Aug 23, 3:43 PM · MediaWiki-Engineering, MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing
CDanis updated subscribers of T317799: Rate limiting for hotlinked images.

@Jelto and @fgiunchedi handled a repeat Jio set-top box hotlink issue between 0600-0730 UTC today. The new requestctl support for cache hits in Varnish (T317794) was successfully exercised by Jelto to stop answering the requests, which is awesome.

Fri, Aug 23, 3:17 PM · Patch-For-Review, SRE-Sprint-Week-Sustainability-March2023, Traffic, Sustainability (Incident Followup)
CDanis added a comment to T373192: Improve eventgates health check/readiness probe.

@Ottomata wrote:

If eventgate cannot produce to Kafka, it is broken. In this case, Kafka overloaded was maybe the problem, but in other cases, it could be misconfigured pods. In that case, it would be better to fail a deployment and rollback.

Fri, Aug 23, 2:28 PM · Patch-For-Review, serviceops-radar, Event-Platform, Data-Engineering

Thu, Aug 22

CDanis added a comment to T372411: Automation to find / summarize "orphaned" traces.

ES has a hardcoded maximum of 1000 top-level fields per index, after which point it will refuse writes that add any more fields, so that's one argument to not enable the flag that puts all tags as fields .... but I could still be convinced. I guess you could just drop the index for that day and start over and it would be fine ... yeah. Yeah actually let's just do all tags as fields? I don't think there's any other downside.

Thu, Aug 22, 7:48 PM · Patch-For-Review, Observability-Tracing
CDanis added a comment to T372411: Automation to find / summarize "orphaned" traces.

Very interesting, yes definitely we should be experimenting with tags-as-fields. I skimmed the upstream post and at this time it isn't clear to me how the transition looks like. Or said otherwise whether jaeger UI transparently supports reading from both "formats"

Thu, Aug 22, 7:46 PM · Patch-For-Review, Observability-Tracing
CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

image.png (784×1 px, 385 KB)

can anyone besides you see this file?

Thu, Aug 22, 12:32 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis attached a referenced file: F57272257: image.png.
Thu, Aug 22, 12:31 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error

Wed, Aug 21

CDanis triaged T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis as High priority.
Wed, Aug 21, 6:17 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis triaged T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes as High priority.
Wed, Aug 21, 6:17 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis updated subscribers of T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes.

@Ladsgroup @Marostegui I have a mix of concrete proposals and naive questions for you re: MariaDB observability :)

Wed, Aug 21, 3:00 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis updated subscribers of T344171: Reverse DNS for k8s pods IPs.

For the record: The CoreDNS Pods are, like all other Pods, reachable by their Pod IP. So going through a debug Pod to resolve the PTRs is not really necessary.

Wed, Aug 21, 1:26 PM · Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis created T373019: Requesting Wikitech:Content administrators access for CDanis.
Wed, Aug 21, 12:38 PM · User-bd808, wikitech.wikimedia.org
CDanis added a comment to T344171: Reverse DNS for k8s pods IPs.

I think in light of T370304 we need to prioritize this -- and a proper version, not just creating static placeholder records.

Wed, Aug 21, 2:20 AM · Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis added a subtask for T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes: T344171: Reverse DNS for k8s pods IPs.
Wed, Aug 21, 2:20 AM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis added a parent task for T344171: Reverse DNS for k8s pods IPs: T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes.
Wed, Aug 21, 2:20 AM · Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis raised the priority of T344171: Reverse DNS for k8s pods IPs from Low to Needs Triage.
Wed, Aug 21, 2:19 AM · Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis created P67405 (An Untitled Masterwork).
Wed, Aug 21, 1:30 AM
CDanis updated the title for P67404 random sample of pod IPs matched against existing CoreDNS PTRs for Endpoints from Masterwork From Distant Lands to random sample of pod IPs matched against existing CoreDNS PTRs for Endpoints.
Wed, Aug 21, 1:12 AM

Tue, Aug 20

CDanis added a subtask for T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis: T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes.
Tue, Aug 20, 9:23 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis added a parent task for T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes: T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.
Tue, Aug 20, 9:22 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis created T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes.
Tue, Aug 20, 9:22 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis edited P67401 really gross hacks -- k8s pod IPs for T370304.
Tue, Aug 20, 9:17 PM
CDanis created P67401 really gross hacks -- k8s pod IPs for T370304.
Tue, Aug 20, 9:16 PM
CDanis added a comment to T372411: Automation to find / summarize "orphaned" traces.

Nice! This is very cool.

Tue, Aug 20, 1:58 PM · Patch-For-Review, Observability-Tracing

Mon, Aug 19

CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Delete is much much more expensive than edits (at least currently due to T20493) it has to remove all the rows from revision table and insert them in archive table, we will eventually fix this and it would be much easier to allow batching and so on but for now we even had to move in the opposite direction (Nuke extension switched to actually queue jobs that would delete each page).

Mon, Aug 19, 3:36 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

With a lot of encouragement I was able to get a long-term graph of that row locks metric in s4:

image.png (1×1 px, 228 KB)

Mon, Aug 19, 2:28 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error

Thu, Aug 15

CDanis added a comment to T320556: Micro-specification for how service owners should propagate tracing headers.

There is a quick way already - Telemetry::getRequestHeaders().

Thu, Aug 15, 1:12 PM · Observability-Tracing

Wed, Aug 14

CDanis updated subscribers of T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.
Wed, Aug 14, 3:49 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis updated subscribers of T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.
Wed, Aug 14, 3:30 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error

Tue, Aug 13

CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Oh, and here's the summarized perf report output from one of those perfs:

Tue, Aug 13, 10:21 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Artifacts I collected:

Tue, Aug 13, 10:16 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis renamed T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis from Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. to Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.
Tue, Aug 13, 10:06 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Internal incident doc: https://docs.google.com/document/d/1lscZB565H5z610ECTpit0lzke-rS3au0Cokn-MAy9xw/edit

Tue, Aug 13, 8:52 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
CDanis added a comment to T320556: Micro-specification for how service owners should propagate tracing headers.

I have a very quick first draft at https://wikitech.wikimedia.org/wiki/Distributed_tracing/Propagating_tracing_context

Tue, Aug 13, 8:13 PM · Observability-Tracing
CDanis updated the task description for T368064: Some mw-api-int traffic is going cross-DC.
Tue, Aug 13, 6:34 PM · MediaWiki-Platform-Team (Radar), serviceops
CDanis updated the task description for T372411: Automation to find / summarize "orphaned" traces.
Tue, Aug 13, 2:34 PM · Patch-For-Review, Observability-Tracing
CDanis created T372411: Automation to find / summarize "orphaned" traces.
Tue, Aug 13, 2:33 PM · Patch-For-Review, Observability-Tracing
CDanis added projects to T372408: Adding a new conftool schema type might require a manual `conftool-merge` invocation: Infrastructure-Foundations, serviceops.
Tue, Aug 13, 2:17 PM · serviceops, Infrastructure-Foundations, conftool
CDanis triaged T372408: Adding a new conftool schema type might require a manual `conftool-merge` invocation as Low priority.
Tue, Aug 13, 2:17 PM · serviceops, Infrastructure-Foundations, conftool
CDanis created T372408: Adding a new conftool schema type might require a manual `conftool-merge` invocation.
Tue, Aug 13, 2:17 PM · serviceops, Infrastructure-Foundations, conftool

Thu, Aug 8

CDanis added a comment to T365361: Establish a process to periodically upgrade the CFSSL infrastructure.

Thanks to @JMeybohm for giving us a good head start on this:

I pushed a branch (wmf-v1.6.5) to our gitlab cfssl repo with the 1.6.5 version (current upstream) plus our patches. I've also updated the debian patches with gbp

Thu, Aug 8, 2:08 PM · CFSSL-PKI, Infrastructure-Foundations

Tue, Aug 6

CDanis updated the task description for T371120: service-utils helper for trace header propagation.
Tue, Aug 6, 5:50 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis updated the task description for T371842: Come up with a roadmap for supporting tracing for batch- or message-driven systems.
Tue, Aug 6, 3:52 PM · Observability-Tracing

Mon, Aug 5

CDanis closed T371129: Extension:CirrusSearch not propagating tracing headers as Resolved.

BTW I opened a new task T371842: Come up with a roadmap for supporting tracing for batch- or message-driven systems with a bunch of pointers for what I think is the current state of the OTel/Jaeger world on this.

Mon, Aug 5, 6:04 PM · MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), Discovery-Search (Current work), Observability-Tracing, CirrusSearch
CDanis closed T371129: Extension:CirrusSearch not propagating tracing headers, a subtask of T320559: Trace header propagation for MediaWiki, as Resolved.
Mon, Aug 5, 6:04 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), MediaWiki-Platform-Team, Observability-Tracing
CDanis triaged T371842: Come up with a roadmap for supporting tracing for batch- or message-driven systems as Low priority.
Mon, Aug 5, 6:03 PM · Observability-Tracing
CDanis created T371842: Come up with a roadmap for supporting tracing for batch- or message-driven systems.
Mon, Aug 5, 6:03 PM · Observability-Tracing
CDanis closed T371390: Enable Jaeger "archived traces" feature as Resolved.
Mon, Aug 5, 3:04 PM · Observability-Tracing
CDanis closed T371390: Enable Jaeger "archived traces" feature, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Mon, Aug 5, 3:02 PM · Epic, Observability-Tracing
CDanis added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

image.png (1×1 px, 210 KB)
fix lgtm :)

Mon, Aug 5, 1:15 PM · Observability-Tracing

Aug 2 2024

CDanis renamed T371644: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman from Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman to Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman.
Aug 2 2024, 3:43 PM · Infrastructure-Foundations, Tools
CDanis added a project to T371644: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman: Infrastructure-Foundations.
Aug 2 2024, 3:43 PM · Infrastructure-Foundations, Tools
CDanis added a comment to T368389: WE4.3.1 - IP traffic.
  • I also think that silent-drop occurring during attack time windows might be a good signal... @CDanis, could you share with me an expanded version of the silentdrop dataset covering the entire month of July to see if these bots triggered any silentdrop action?
Aug 2 2024, 3:15 PM · Knowledge-Integrity, OKR-Work, Research (FY2024-25-Research-July-September)
CDanis assigned T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata to Krinkle.
Aug 2 2024, 1:37 PM · MediaWiki-Engineering, MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing
CDanis updated the task description for T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata.
Aug 2 2024, 1:31 PM · MediaWiki-Engineering, MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing

Aug 1 2024

CDanis added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

I ran otelcol with 1Gi RAM in codfw for half an hour.

Aug 1 2024, 6:01 PM · Observability-Tracing
CDanis added a comment to T368389: WE4.3.1 - IP traffic.

Oh, and, silent-drop occurring during attack time windows should still make for a good signal, I think.

Aug 1 2024, 1:56 PM · Knowledge-Integrity, OKR-Work, Research (FY2024-25-Research-July-September)
CDanis added a comment to T368389: WE4.3.1 - IP traffic.

No, this is not expected behavior. I'm guessing this is external Mediawiki instances using our "Mathoid as a service" . I also see some requests there for "Instant Commons".

Aug 1 2024, 1:52 PM · Knowledge-Integrity, OKR-Work, Research (FY2024-25-Research-July-September)
CDanis closed T371439: Requesting temporary lift of IP cap for azwiki as Resolved.
Aug 1 2024, 1:46 PM · Wikimedia-Site-requests

Jul 31 2024

CDanis added a comment to T371129: Extension:CirrusSearch not propagating tracing headers.

There's not a finished plan, no, but I'd be very happy to try some experiments with you next Q perhaps?

Jul 31 2024, 6:31 PM · MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), Discovery-Search (Current work), Observability-Tracing, CirrusSearch
CDanis closed T367905: Application Security Review Request : OpenTelemetry PHP SDK as Resolved.

Thanks so much!

Jul 31 2024, 4:46 PM · Privacy Engineering, MediaWiki-Vendor, secscrum, Security, Application Security Reviews
CDanis closed T367905: Application Security Review Request : OpenTelemetry PHP SDK, a subtask of T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata, as Resolved.
Jul 31 2024, 4:45 PM · MediaWiki-Engineering, MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing

Jul 30 2024

CDanis created P67085 (An Untitled Masterwork).
Jul 30 2024, 7:43 PM
CDanis added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

That's really odd -- I see the collector starting itself back up before the container gets restarted ... maybe externally?

Jul 30 2024, 3:24 PM · Observability-Tracing
CDanis created T371390: Enable Jaeger "archived traces" feature.
Jul 30 2024, 2:21 PM · Observability-Tracing
CDanis added a comment to T371144: support the haproxy silent-drop hysteresis gadget from requestctl.
Jul 30 2024, 1:52 PM · User-CDanis, User-Joe, conftool, Traffic

Jul 29 2024

CDanis added a comment to T370739: Figure out how a shellbox instance for the Chart extension would work.

Rate limiting is broken in service-runner for a long time now. See T200374#9558666. As far as the worker pool goes, we 've conducted some experiments in that past. It does have some benefits reliability wise (faster restarts in case of failures), but otherwise, in a Kubernetes environment, it's more of a nice to have than a requirement.

Jul 29 2024, 7:28 PM · Charts (Sprint 3), serviceops, SRE, Shellbox
CDanis added a comment to T266886: Augment NEL reports with a computed timestamp-of-generation.

Yep, Logstash presently, although it would be nice if we had them in Hive some day as well :)

Jul 29 2024, 4:52 PM · SRE Observability (FY2024/2025-Q1), Observability-Logging, Data-Engineering-Icebox, Analytics
CDanis added a comment to T371144: support the haproxy silent-drop hysteresis gadget from requestctl.

A way to limit concurrency from specific IPs, only for requests corresponding to a specific pattern? So limit concurrency for a user over the threshold only if they're hitting url X with user-agent Y?

Jul 29 2024, 4:33 PM · User-CDanis, User-Joe, conftool, Traffic

Jul 27 2024

CDanis updated the task description for T371120: service-utils helper for trace header propagation.
Jul 27 2024, 3:20 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis added a comment to T370739: Figure out how a shellbox instance for the Chart extension would work.

@akosiaris I'm trying to figure out how we should proceed based on your comment. Should we develop a service based on (an up-to-date version of) service-runner, and maybe look at service-template-node loosely for inspiration but not copy all of it verbatim, and then have that reviewed by serviceops + security prior to deployment? We probably don't need most of service-template-node's dependencies anyway, I think we probably just need service-runner, express and maybe body-parser. The other dependencies are either outdated (e.g. bluebird), irrelevant (e.g. domino, we don't need to parse HTML) or overkill for our use case (e.g. swagger-router, we expect there to only be one endpoint).

Jul 27 2024, 12:50 AM · Charts (Sprint 3), serviceops, SRE, Shellbox

Jul 26 2024

CDanis created T371144: support the haproxy silent-drop hysteresis gadget from requestctl.
Jul 26 2024, 7:58 PM · User-CDanis, User-Joe, conftool, Traffic
Ottomata awarded T371120: service-utils helper for trace header propagation a Like token.
Jul 26 2024, 6:25 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis updated the task description for T371120: service-utils helper for trace header propagation.
Jul 26 2024, 6:15 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis added a project to T371129: Extension:CirrusSearch not propagating tracing headers: Observability-Tracing.
Jul 26 2024, 5:29 PM · MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), Discovery-Search (Current work), Observability-Tracing, CirrusSearch
CDanis added a subtask for T320559: Trace header propagation for MediaWiki: T371129: Extension:CirrusSearch not propagating tracing headers.
Jul 26 2024, 5:28 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), MediaWiki-Platform-Team, Observability-Tracing
CDanis added a parent task for T371129: Extension:CirrusSearch not propagating tracing headers: T320559: Trace header propagation for MediaWiki.
Jul 26 2024, 5:28 PM · MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), Discovery-Search (Current work), Observability-Tracing, CirrusSearch
CDanis created T371129: Extension:CirrusSearch not propagating tracing headers.
Jul 26 2024, 5:28 PM · MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), Discovery-Search (Current work), Observability-Tracing, CirrusSearch
CDanis added a comment to T320556: Micro-specification for how service owners should propagate tracing headers.

Would also like some guidance for how to pass on this info in emitted data events. We already produce the request_id, but perhaps we can make a more standardized tracing schema fragment for this.

Jul 26 2024, 5:13 PM · Observability-Tracing
CDanis added a parent task for T371120: service-utils helper for trace header propagation: T360924: Replace service runner with a simplified library to better support metrics and debugging: service-utils.
Jul 26 2024, 4:39 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis added a subtask for T360924: Replace service runner with a simplified library to better support metrics and debugging: service-utils: T371120: service-utils helper for trace header propagation.
Jul 26 2024, 4:39 PM · Data-Engineering (Q1 2024 July 1st - September 30th)
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T371124: request for new matomo site: trace.wikimedia.org/.
Jul 26 2024, 4:30 PM · Epic, Observability-Tracing
CDanis added a parent task for T371124: request for new matomo site: trace.wikimedia.org/: T320549: distributed tracing v0 [minimum viable].
Jul 26 2024, 4:30 PM · Data-Platform-SRE (2024.07.29 - 2024.08.16), Data-Engineering
CDanis created T371124: request for new matomo site: trace.wikimedia.org/.
Jul 26 2024, 4:30 PM · Data-Platform-SRE (2024.07.29 - 2024.08.16), Data-Engineering
CDanis added a project to T371120: service-utils helper for trace header propagation: Observability-Tracing.
Jul 26 2024, 4:10 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis added a subtask for T340551: distributed tracing epic: T371120: service-utils helper for trace header propagation.
Jul 26 2024, 4:10 PM · Epic, Observability-Tracing
CDanis added a parent task for T371120: service-utils helper for trace header propagation: T340551: distributed tracing epic.
Jul 26 2024, 4:10 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis created T371120: service-utils helper for trace header propagation.
Jul 26 2024, 4:10 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Observability-Tracing
CDanis updated subscribers of T348734: Port defs_from_etcd logic to nftables.

@Jelto @Dzahn FYI ^ about nft and requestctl support without Ferm.

Jul 26 2024, 2:00 PM · Infrastructure-Foundations, SRE

Jul 25 2024

CDanis added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

If you make the traces service also export to the basic logging service, then (post-processing) all traces will get dumped out pretty verbosely into the logs.

Jul 25 2024, 8:00 PM · Observability-Tracing