[go: nahoru, domu]

Page MenuHomePhabricator

Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Jul 4 2024, 8:25 AM
Referenced Files
F56456953: 2024-07-16-113256_1675x703_scrot.png
Jul 16 2024, 9:33 AM
F56456929: 2024-07-16-113051_1872x1054_scrot.png
Jul 16 2024, 9:33 AM
F56217500: 2024-07-04-102203_2503x961_scrot.png
Jul 4 2024, 8:25 AM
F56217493: 2024-07-04-101918_1265x736_scrot.png
Jul 4 2024, 8:25 AM
F56217248: 2024-07-04-100831_2485x988_scrot.png
Jul 4 2024, 8:25 AM
Tokens
"Love" token, awarded by CDanis."Party Time" token, awarded by colewhite."Love" token, awarded by kamila.

Description

We've started getting kafka lag alerts for benthos-mw-accesslog-sampler consumer group (this is the benthos instance that runs on logstash hosts, consumes mediawiki.httpd.accesslog and writes a sample to mediawiki.httpd.accesslog-sampled.

There's indeed a background lag of ~4k messages for a bunch of partitions

2024-07-04-100831_2485x988_scrot.png (988×2 px, 212 KB)

Given the following:

  • the problems we've had with the same topic in T367076 (cc @kamila)
  • benthos webrequest live (webrequest firehose) doesn't have lag for a different topic
  • benthos webrequest live uses ~24 cores at peak (on centrallog hosts) to consume ~160k msg/s and sample them
  • logstash hosts (where benthos accesslog sampler runs) are not starved for CPU nor memory

I've taken a look at mediawiki.httpd.accesslog topic, and indeed the topic has 6 partitions, though kafka-logging1001 is leader for 3 of them:

Topic:mediawiki.httpd.accesslog PartitionCount:6        ReplicationFactor:3     Configs:retention.ms=172800000
        Topic: mediawiki.httpd.accesslog        Partition: 0    Leader: 1001    Replicas: 1001,1002,1005        Isr: 1001,1005,1002
        Topic: mediawiki.httpd.accesslog        Partition: 1    Leader: 1004    Replicas: 1004,1001,1002        Isr: 1001,1002,1004
        Topic: mediawiki.httpd.accesslog        Partition: 2    Leader: 1001    Replicas: 1001,1002,1004        Isr: 1001,1002,1004
        Topic: mediawiki.httpd.accesslog        Partition: 3    Leader: 1006    Replicas: 1006,1002,1001        Isr: 1001,1006,1002
        Topic: mediawiki.httpd.accesslog        Partition: 4    Leader: 1001    Replicas: 1001,1006,1002        Isr: 1001,1006,1002
        Topic: mediawiki.httpd.accesslog        Partition: 5    Leader: 1005    Replicas: 1005,1002,1001        Isr: 1001,1002,1005

Unsurprisingly 1001 gets 3x the traffic in msg/s of others (green line)

2024-07-04-101918_1265x736_scrot.png (736×1 px, 89 KB)

This topic is also by far the one with the most traffic on kafka-logging by far:

2024-07-04-102203_2503x961_scrot.png (961×2 px, 215 KB)

Therefore I think we should be increasing the number of partitions (to 12 just to get an even number), therefore increasing concurrency for consumers and thus performance. We should also be rebalancing the topic to ensure even broker distribution.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-07-16T08:32:53Z] <godog> root@kafka-logging1001:~# kafka topics --alter --topic mediawiki.httpd.accesslog --partitions 12 - T369256

Mentioned in SAL (#wikimedia-operations) [2024-07-16T09:20:51Z] <godog> bounce benthos@mw_accesslog_sampler - T369256

I'm happy to report that increasing partition count and restarting benthos mw sampler did the trick, lag was reduced even before the restart and completely gone after restart:

2024-07-16-113051_1872x1054_scrot.png (1×1 px, 254 KB)

topic is not perfectly balanced yet but better than it was (green line is kafka-logging1001)

2024-07-16-113256_1675x703_scrot.png (703×1 px, 102 KB)

Once topicmappr works or we otherwise have resolution of https://phabricator.wikimedia.org/T370129 we can rebalance the topic properly

FTR, benthos@mw_accesslog_metrics got wedged on centrallog1002 in an odd way today, not generating any metrics since 15:12, and even more weirdly, not generating any consumer group lag until *after* we fixed the issue.

We fixed the issue by sending the benthos process a SIGQUIT, which makes golang dump stack traces of all active goroutines.

The two goroutine stacks that said anything about Kafka are here

1Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 349737364 [select, 200 minutes]:
2Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.gopark(0xc02c883ed0?, 0x2?, 0x0?, 0x0?, 0xc02c883ea4?)
3Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/proc.go:363 +0xd6 fp=0xc02c883d20 sp=0xc02c883d00 pc=0x43e936
4Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.selectgo(0xc02c883ed0, 0xc02c883ea0, 0xa?, 0x0, 0xc19dc157146a55ee?, 0x1)
5Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/select.go:328 +0x7bc fp=0xc02c883e60 sp=0xc02c883d20 pc=0x44e81c
6Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).ReadBatch(0x7fa21d409498?, {0x47ebeb0, 0xc000edc7c0})
7Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:723 +0x111 fp=0xc02c883f20 sp=0xc02c883e60 pc=0x16d1471
8Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/public/service.AutoRetryNacksBatched.func1({0x47ebeb0?, 0xc000edc7c0?})
9Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/public/service/input_auto_retry_batched.go:24 +0x3c fp=0xc02c883f98 sp=0xc02c883f20 pc=0xd96a7c
10Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).dispatchReader(0xc000bd8510?)
11Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:119 +0x51 fp=0xc02c883fc0 sp=0xc02c883f98 pc=0xd9ecf1
12Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).Shift.func3()
13Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:144 +0x2a fp=0xc02c883fe0 sp=0xc02c883fc0 pc=0xd9d9ea
14Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.goexit()
15Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc02c883fe8 sp=0xc02c883fe0 pc=0x470281
16Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: created by github.com/benthosdev/benthos/v4/internal/autoretry.(*List[...]).Shift
17Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/autoretry/auto_retry_list.go:144 +0x21f
18
19Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 65 [select]:
20Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.gopark(0xc001c53aa8?, 0x3?, 0x88?, 0x38?, 0xc001c539e2?)
21Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/proc.go:363 +0xd6 fp=0xc001c53860 sp=0xc001c53840 pc=0x43e936
22Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.selectgo(0xc001c53aa8, 0xc001c539dc, 0xc001c53a98?, 0x0, 0x47ebeb0?, 0x1)
23Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/select.go:328 +0x7bc fp=0xc001c539a0 sp=0xc001c53860 pc=0x44e81c
24Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/twmb/franz-go/pkg/kgo.(*Client).PollRecords(0xc0012e7000, {0x47ebf20, 0xc0643276e0}, 0xffffffffffffffff)
25Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /tmp/benthos/go/pkg/mod/github.com/twmb/franz-go@v1.13.0/pkg/kgo/consumer.go:500 +0x2cb fp=0xc001c53ae8 sp=0xc001c539a0 pc=0x14fd30b
26Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/twmb/franz-go/pkg/kgo.(*Client).PollFetches(...)
27Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /tmp/benthos/go/pkg/mod/github.com/twmb/franz-go@v1.13.0/pkg/kgo/consumer.go:363
28Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).Connect.func5()
29Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:652 +0x13a fp=0xc001c53fe0 sp=0xc001c53ae8 pc=0x16d029a
30Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: runtime.goexit()
31Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /usr/lib/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc001c53fe8 sp=0xc001c53fe0 pc=0x470281
32Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: created by github.com/benthosdev/benthos/v4/internal/impl/kafka.(*franzKafkaReader).Connect
33Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: /build/benthos-4.15.0/internal/impl/kafka/input_kafka_franz.go:630 +0xd93
34

Also this in logs around the time consuming stopped:

Jul 16 15:11:59 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:13 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1003.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1004.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:28 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input

Initially I thought this might have sth to do with centrallog2002 upgrade to 10g, though I don't think it does because benthos@mw_accesslog_sampler only consumes from its local kafka-logging

There's similar logs in benthos@webrequest_live on centrallog1002:

Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output
Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

hehe it is odd in this situation for sure since we've survived kafka brokers being unavailable before, and indeed one other benthos instance did survive fine to the maintenance work

I have reached out to upstream at https://github.com/redpanda-data/connect/issues/2705 to make them aware, in case they have seen this before. I'm not overly concerned just yet in the sense that IIRC this is the first time we've seen such a failure and it could be a unlucky coincidence

fgiunchedi claimed this task.

I'm tentatively resolving this since even though the topic isn't perfectly balanced I'm calling it good enough™. We can revisit as needed

Mentioned in SAL (#wikimedia-operations) [2024-07-18T19:37:08Z] <denisse> Send SIGQUIT signal to the benthos service after a goroutine was waiting forever in webrequest_live.yaml - T369256

Please note that we can also rebalance leadership per broker using topicmappr without increasing the number of partitions.

Please note that we can also rebalance leadership per broker using topicmappr without increasing the number of partitions.

Thank you, I tried using topicmappr to rebalance and ran into https://phabricator.wikimedia.org/T370129 maybe you have some insight and/or can help?

Change #1057798 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] benthos: smaller batches for mw_accesslog_metrics

https://gerrit.wikimedia.org/r/1057798

Change #1057798 merged by Filippo Giunchedi:

[operations/puppet@production] benthos: smaller batches for mw_accesslog_metrics

https://gerrit.wikimedia.org/r/1057798