Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jul 4 2024, 8:25 AM

Description

We've started getting kafka lag alerts for benthos-mw-accesslog-sampler consumer group (this is the benthos instance that runs on logstash hosts, consumes mediawiki.httpd.accesslog and writes a sample to mediawiki.httpd.accesslog-sampled.

There's indeed a background lag of ~4k messages for a bunch of partitions

2024-07-04-100831_2485x988_scrot.png (988×2 px, 212 KB)

Given the following:

the problems we've had with the same topic in T367076 (cc @kamila)
benthos webrequest live (webrequest firehose) doesn't have lag for a different topic
benthos webrequest live uses ~24 cores at peak (on centrallog hosts) to consume ~160k msg/s and sample them
logstash hosts (where benthos accesslog sampler runs) are not starved for CPU nor memory

I've taken a look at mediawiki.httpd.accesslog topic, and indeed the topic has 6 partitions, though kafka-logging1001 is leader for 3 of them:

Topic:mediawiki.httpd.accesslog PartitionCount:6        ReplicationFactor:3     Configs:retention.ms=172800000
        Topic: mediawiki.httpd.accesslog        Partition: 0    Leader: 1001    Replicas: 1001,1002,1005        Isr: 1001,1005,1002
        Topic: mediawiki.httpd.accesslog        Partition: 1    Leader: 1004    Replicas: 1004,1001,1002        Isr: 1001,1002,1004
        Topic: mediawiki.httpd.accesslog        Partition: 2    Leader: 1001    Replicas: 1001,1002,1004        Isr: 1001,1002,1004
        Topic: mediawiki.httpd.accesslog        Partition: 3    Leader: 1006    Replicas: 1006,1002,1001        Isr: 1001,1006,1002
        Topic: mediawiki.httpd.accesslog        Partition: 4    Leader: 1001    Replicas: 1001,1006,1002        Isr: 1001,1006,1002
        Topic: mediawiki.httpd.accesslog        Partition: 5    Leader: 1005    Replicas: 1005,1002,1001        Isr: 1001,1002,1005

Unsurprisingly 1001 gets 3x the traffic in msg/s of others (green line)

2024-07-04-101918_1265x736_scrot.png (736×1 px, 89 KB)

This topic is also by far the one with the most traffic on kafka-logging by far:

2024-07-04-102203_2503x961_scrot.png (961×2 px, 215 KB)

Therefore I think we should be increasing the number of partitions (to 12 just to get an even number), therefore increasing concurrency for consumers and thus performance. We should also be rebalancing the topic to ensure even broker distribution.

Details

	Subject	Repo	Branch	Lines +/-
	benthos: smaller batches for mw_accesslog_metrics	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects

Mentioned In: T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors
T370386: statograph_post errors with out of range float values since 2024-07-16
T370264: benthos mw-accesslog-metrics interpolation errors
T370129: topicmappr marshal error on kafka-logging cluster
T366308: More Benthos instances consumes slower?
T369258: Cleanup kafka-logging topic names
Mentioned Here: T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad
P66676 Masterwork From Distant Lands
T370129: topicmappr marshal error on kafka-logging cluster
T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors

Event Timeline

fgiunchedi created this task.Jul 4 2024, 8:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2024, 8:25 AM

fgiunchedi mentioned this in T369258: Cleanup kafka-logging topic names.Jul 4 2024, 8:49 AM

fgiunchedi mentioned this in T366308: More Benthos instances consumes slower?.Jul 4 2024, 12:46 PM

herron subscribed.Jul 11 2024, 2:48 PM

Mentioned in SAL (#wikimedia-operations) [2024-07-16T08:32:53Z] <godog> root@kafka-logging1001:~# kafka topics --alter --topic mediawiki.httpd.accesslog --partitions 12 - T369256

fgiunchedi mentioned this in T370129: topicmappr marshal error on kafka-logging cluster.Jul 16 2024, 9:15 AM

Mentioned in SAL (#wikimedia-operations) [2024-07-16T09:20:51Z] <godog> bounce benthos@mw_accesslog_sampler - T369256

I'm happy to report that increasing partition count and restarting benthos mw sampler did the trick, lag was reduced even before the restart and completely gone after restart:

2024-07-16-113051_1872x1054_scrot.png (1×1 px, 254 KB)

topic is not perfectly balanced yet but better than it was (green line is kafka-logging1001)

2024-07-16-113256_1675x703_scrot.png (703×1 px, 102 KB)

Once topicmappr works or we otherwise have resolution of https://phabricator.wikimedia.org/T370129 we can rebalance the topic properly

kamila awarded a token.Jul 16 2024, 10:28 AM

colewhite awarded a token.Jul 16 2024, 12:30 PM

CDanis awarded a token.Jul 16 2024, 2:28 PM

FTR, benthos@mw_accesslog_metrics got wedged on centrallog1002 in an odd way today, not generating any metrics since 15:12, and even more weirdly, not generating any consumer group lag until *after* we fixed the issue.

We fixed the issue by sending the benthos process a SIGQUIT, which makes golang dump stack traces of all active goroutines.

The two goroutine stacks that said anything about Kafka are here

P66676 Masterwork From Distant Lands

1	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 349737364 [select, 200 minutes]:
2	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.gopark(0xc02c883ed0?, 0x2?, 0x0?, 0x0?, 0xc02c883ea4?)
3	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/proc.go:363 +0xd6 fp=0xc02c883d20 sp=0xc02c883d00 pc=0x43e936
4	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.selectgo(0xc02c883ed0, 0xc02c883ea0, 0xa?, 0x0, 0xc19dc157146a55ee?, 0x1)
5	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/select.go:328 +0x7bc fp=0xc02c883e60 sp=0xc02c883d20 pc=0x44e81c
6	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).ReadBatch(0x7fa21d409498?, {0x47ebeb0, 0xc000edc7c0})
7	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:723 +0x111 fp=0xc02c883f20 sp=0xc02c883e60 pc=0x16d1471
8	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/public/service.AutoRetryNacksBatched.func1({0x47ebeb0?, 0xc000edc7c0?})
9	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/public/service/input_auto_retry_batched.go:24 +0x3c fp=0xc02c883f98 sp=0xc02c883f20 pc=0xd96a7c
10	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).dispatchReader(0xc000bd8510?)
11	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:119 +0x51 fp=0xc02c883fc0 sp=0xc02c883f98 pc=0xd9ecf1
12	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).Shift.func3()
13	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:144 +0x2a fp=0xc02c883fe0 sp=0xc02c883fc0 pc=0xd9d9ea
14	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.goexit()
15	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc02c883fe8 sp=0xc02c883fe0 pc=0x470281
16	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: created by github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).Shift
17	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:144 +0x21f
18
19	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 65 [select]:
20	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.gopark(0xc001c53aa8?, 0x3?, 0x88?, 0x38?, 0xc001c539e2?)
21	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/proc.go:363 +0xd6 fp=0xc001c53860 sp=0xc001c53840 pc=0x43e936
22	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.selectgo(0xc001c53aa8, 0xc001c539dc, 0xc001c53a98?, 0x0, 0x47ebeb0?, 0x1)
23	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/select.go:328 +0x7bc fp=0xc001c539a0 sp=0xc001c53860 pc=0x44e81c
24	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/twmb/franz-go/pkg/kgo.(*Client).PollRecords(0xc0012e7000, {0x47ebf20, 0xc0643276e0}, 0xffffffffffffffff)
25	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /tmp/benthos/go/pkg/mod/github.com/twmb/franz-go@v1.13.0/pkg/kgo/consumer.go:500 +0x2cb fp=0xc001c53ae8 sp=0xc001c539a0 pc=0x14fd30b
26	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/twmb/franz-go/pkg/kgo.(*Client).PollFetches(...)
27	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /tmp/benthos/go/pkg/mod/github.com/twmb/franz-go@v1.13.0/pkg/kgo/consumer.go:363
28	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).Connect.func5()
29	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:652 +0x13a fp=0xc001c53fe0 sp=0xc001c53ae8 pc=0x16d029a
30	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.goexit()
31	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc001c53fe8 sp=0xc001c53fe0 pc=0x470281
32	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: created by github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).Connect
33	Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:630 +0xd93
34

Also this in logs around the time consuming stopped:

Jul 16 15:11:59 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:13 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1003.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1004.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:28 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input

Initially I thought this might have sth to do with centrallog2002 upgrade to 10g, though I don't think it does because benthos@mw_accesslog_sampler only consumes from its local kafka-logging

There's similar logs in benthos@webrequest_live on centrallog1002:

Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output
Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

In T369256#9989598, @cmooney wrote:

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

hehe it is odd in this situation for sure since we've survived kafka brokers being unavailable before, and indeed one other benthos instance did survive fine to the maintenance work

I have reached out to upstream at https://github.com/redpanda-data/connect/issues/2705 to make them aware, in case they have seen this before. I'm not overly concerned just yet in the sense that IIRC this is the first time we've seen such a failure and it could be a unlucky coincidence

fgiunchedi mentioned this in T370264: benthos mw-accesslog-metrics interpolation errors.Jul 17 2024, 12:45 PM

fgiunchedi mentioned this in T370386: statograph_post errors with out of range float values since 2024-07-16.Jul 18 2024, 6:49 AM

I'm tentatively resolving this since even though the topic isn't perfectly balanced I'm calling it good enough™. We can revisit as needed

Mentioned in SAL (#wikimedia-operations) [2024-07-18T19:37:08Z] <denisse> Send SIGQUIT signal to the benthos service after a goroutine was waiting forever in webrequest_live.yaml - T369256

Please note that we can also rebalance leadership per broker using topicmappr without increasing the number of partitions.

In T369256#9998068, @brouberol wrote:

Please note that we can also rebalance leadership per broker using topicmappr without increasing the number of partitions.

Thank you, I tried using topicmappr to rebalance and ran into https://phabricator.wikimedia.org/T370129 maybe you have some insight and/or can help?

kamila mentioned this in T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.Jul 23 2024, 10:11 AM

Change #1057798 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] benthos: smaller batches for mw_accesslog_metrics

https://gerrit.wikimedia.org/r/1057798

gerritbot added a project: Patch-For-Review.Jul 29 2024, 7:40 AM

Change #1057798 merged by Filippo Giunchedi: