Failed to recover from broker unavailable using kafka_franz #2705

filippog · 2024-07-17T12:40:04Z

Hello,
we're using ~~benth~~redpanda connect 4.15 with great success, for the biggest part to read/write to/from kafka using kafka_franz component.

Yesterday there's been a network maintenance which temporarily made a broker unavailable, which made kafka_franz lament a failure to commit. This has happened before and things self-recovered, not this time though where consuming was stopped (and observed consumer lag didn't increase, or we would have noticed sooner).

Please find below the logs from benthos

Jul 16 15:11:59 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:13 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-logging1005.eqiad.wmnet:9093" broker=1005 err="dial tcp [2620:0:861:10e:10:64:135:13]:9093: i/o timeout" label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1003.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1004.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input
Jul 16 15:12:28 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input

The full goroutine dump which we got via SIGQUIT: https://phabricator.wikimedia.org/P66726
And the configuration which was running at the time: https://phabricator.wikimedia.org/P66727

Finally, the metrics showing outage + recovery (all times UTC everywhere)

Interestingly enough, a similar but different benthos instance, acting on a different kafka cluster / broker, which also was affected by the maintenance, was able to recover on its own as it's been our experience with similar maintenance in the past.

The text was updated successfully, but these errors were encountered:

mihaitodor · 2024-07-17T13:46:04Z

Hey @filippog 👋 Thank you for the detailed report! If you still have that instance running and you can poke at it, does it continue to print unable to dial: dial tcp... errors or did it get completely stuck and it's no longer emitting any logs? Based on the goroutine dump it's definitely still running the loop which is meant to "poll fetches": https://github.com/redpanda-data/connect/blob/8277b6088df5829bead2e2a5e131ed3ad8e518fe/internal/impl/kafka/input_kafka_franz.go#L652C7-L652C23.

If it keeps printing those errors, then I'm wondering if it may be some sort of DNS issue on the host where Benthos is running and, if that's the case, it would be handy to know if you can telnet to the seed broker addresses from that host. I don't know how the franz-go handles DNS and if it does any caching, but we can look into that if you can confirm the above.

filippog · 2024-07-18T07:12:52Z

Hi @mihaitodor ! thank you for following up, unfortunately the instance is no longer running as it was restarted to restore service. It was indeed stuck and not emitting logs, these are the last lines (value interpolation error is known and something emitted regularly and we're addressing it separatedly, those regular logs stopped too). You can see logs stop at 15:16 and then back at 18:33 when SIGQUIT was issued

Jul 16 15:12:29 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jul 16 15:12:29 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093:
 i/o timeout" label=webrequest_sampled_out path=root.output                                                                                                                                                                                    Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093:
 i/o timeout" label=webrequest_sampled_out path=root.output                                                                                                                                                                                    Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: SIGQUIT: quit
Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: PC=0x4721c1 m=0 sigcode=0
Jul 16 18:33:20 centrallog1002 benthos@mw_accesslog_metrics[2184438]: goroutine 0 [idle]:

And the service was then restarted and came back up here:

Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Running main config from specified file" @service=benthos path=/etc/benthos/mw_accesslog_metrics.yaml
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Listening for HTTP requests at: http://0.0.0.0:4152" @service=benthos
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Receiving messages from Kafka topics: [mediawiki.httpd.accesslog]" @service=benthos label=accesslog path=root.input
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Launching a benthos instance, use CTRL+C to close" @service=benthos
Jul 16 18:33:21 centrallog1002 benthos@mw_accesslog_metrics[2395937]: level=info msg="Dropping messages." @service=benthos label=drop path=root.output

If it keeps printing those errors, then I'm wondering if it may be some sort of DNS issue on the host where Benthos is running and, if that's the case, it would be handy to know if you can telnet to the seed broker addresses from that host. I don't know how the franz-go handles DNS and if it does any caching, but we can look into that if you can confirm the above.

From the host perspective and given the network maintenance the outage manifested itself with temporary unavailability of some broker hosts (i.e. DNS kept working as usual).

For comparison and in case it is useful, here's what another benthos instance on the same host logged, this instance kept working as usual, and it is talking to a different kafka cluster, of which one broker was equally affected by the maintenance:

Jul 16 15:16:29 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output
Jul 16 15:16:39 centrallog1002 benthos@webrequest_live[1437397]: level=warning msg="unable to open connection to broker" @service=benthos addr="kafka-jumbo1014.eqiad.wmnet:9093" broker=1014 err="dial tcp [2620:0:861:10e:10:64:135:16]:9093: i/o timeout" label=webrequest_sampled_out path=root.output

I'm hoping this is a one-off and IIRC the first time we've seen this, however I'm reporting it in case other folks run into the same and/or you have seen this before. Thank you !

mihaitodor added the needs more info An issue that may be a bug or useful feature, but requires more information label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to recover from broker unavailable using kafka_franz #2705

Failed to recover from broker unavailable using kafka_franz #2705

Failed to recover from broker unavailable using kafka_franz #2705

Failed to recover from broker unavailable using kafka_franz #2705

Comments