Intermittent DNS name resolution failure, likely IPv6-related #29069

schams-net · 2023-09-05T04:54:24Z

systemd version the issue has been seen with

252

Used distribution

Debian 12

Linux kernel version used

6.1.0-10-cloud-amd64

CPU architectures issue was seen on

x86_64

Component

systemd-resolved

Expected behaviour you didn't see

DNS look-ups with a constant 100% success rate for IPv4 and IPv6 addresses.

Unexpected behaviour you saw

Intermittent DNS name resolution failures with a success rate of approx. 90% (approx. 10 errors out of 100 queries), likely IPv6-related.

Official Debian v12 AMIs on AWS show intermittent DNS name resolution failures when they perform DNS look-ups of an RDS Aurora endpoint. The instances use the Amazon-provided DNS server (VPC.2). While approx. 90 percent of the DNS queries succeed, 10 percent fail (name or service not known).

The issue occurs on Debian v12 but doesn't occur on Debian v11 or Amazon Linux 2 systems (same environment). No error occurs if the "systemd-resolved" service is disabled or bypassed on Debian v12. No error occurs if IPv6 is disabled on the Debian v12 system.

Steps to reproduce the problem

Please find a detailed problem description including analysis, steps how to reproduce the problem, and a Terraform plan to provision a test infrastructure in the following Git repository: https://github.com/schams-net/aws-ec2-rds-dns-lookup-issue

Additional program output to the terminal or log subsystem illustrating the issue

No response

wjsc · 2023-10-26T14:55:32Z

Hi! Any update on this issue? any new workaround?

jackarias · 2023-10-26T19:45:40Z

Hello! Same problem here. It would be nice to have a workaround on this asap.

schams-net · 2023-10-26T21:30:31Z

As documented here, the workaround across our server fleet is a small change to the file /etc/nsswitch.conf. Locate this line:

hosts: files resolve [!UNAVAIL=return] dns

...and remove the "resolve" service. After the change, the line reads:

hosts: files dns

Note that this is not the final solution but only a workaround.

levinse · 2023-12-18T23:44:59Z

We are experiencing this since upgrading to Debian 12 AMIs

ahmgithubahm · 2024-04-09T07:58:28Z

We're affected by this, using Debian 12 AWS EC2 AMIs. As well as agreeing that it is somehow IPv6 related, it also seems to only cause resolution failures for CNAMEs. At least in my testing with A and CNAME lookups, only CNAMEs exhibit the intermittent failures.

Our workaround has been to simply purge the systemd-resolved package entirely, which has the effect of sorting out nsswitch on its way out of the door. <edit> DO NOT USE THIS WORKAROUND, IT BREAKS DNS NAME RESOLUTION AFTER A REBOOT, SORRY 🤡

apt purge --auto-remove systemd-resolved

Thanks to @schams-net for the detailed repro details. I shamelessly grabbed the bash fragment [1] to easily reproduce the issue. Using the script to resolve a CNAME produced intermittent failures. Pointing it at an A record gave no failures. Telling netcat to use IPv4 (nc -4) also meant no failures, for A or CNAME lookups.

[1] https://github.com/schams-net/aws-ec2-rds-dns-lookup-issue/blob/main/docs/how-to-reproduce-the-issue.md

nmeyerhans · 2024-04-09T15:04:57Z

@ahmgithubahm You don't need to remove all of systemd-resolved. Just removing libnss-resolve will accomplish what you need without disabling resolv.conf management altogether. Debian 12.6 images, when they're published, will make this the default.

ahmgithubahm · 2024-04-11T07:54:16Z

Thanks @nmeyerhans, that's good to know. I've edited my post now I realise that my workaround leaves you with no working /etc/resolv.conf after a reboot. Oops. 🤡

nmeyerhans · 2024-04-15T20:48:39Z

@schams-net @ahmgithubahm @jackarias @levinse @wjsc Be aware that the Debian 12 cloud images just released (version 20240415-1718) no longer enable libnss-resolve by default. So this issue should be mitigated in instances launched from these and future images.

If somebody who has encountered this issue on Debian 12 would be able to test the Debian sid images, that would help the systemd maintainers understand if the issue is still present.

kuzminets · 2024-04-16T05:02:07Z

Thanks @nmeyerhans! Is the problem somehow specific to AWS? I can't reproduce it from a local VM with libnss-resolve being enabled.

nmeyerhans · 2024-04-17T20:48:38Z

I don't see any reason to think it'd be specific to AWS, but it's possible. As the problem is intermittent, it could be that something about the AWS DNS infrastructure triggers it more frequently, but that's just speculation.

schams-net · 2024-05-06T05:51:03Z

I can confirm that I'm unable reproduce the issue with the latest AMIs at AWS anymore 🥳
I tested debian-12-amd64-20240429-1732 (for example AMI ID ami-02175b0058d3ce245 in us-east-1). The file /etc/nsswitch.conf contains the following configuration in the old vs. new image.

Previously (with "resolve"):

hosts: files resolve [!UNAVAIL=return] dns

New(er) images (without "resolve"):

hosts: files dns myhostname

(As @nmeyerhans pointed out).

I haven't had the chance to test the Debian sid images yet.

schams-net added the bug 🐛 Programming errors, that need preferential fixing label Sep 5, 2023

github-actions bot added the resolve label Sep 5, 2023

liam-lloyd mentioned this issue May 30, 2024

Update Debian AMIs to fix DNS bug PermanentOrg/infrastructure#157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent DNS name resolution failure, likely IPv6-related #29069

Intermittent DNS name resolution failure, likely IPv6-related #29069

Intermittent DNS name resolution failure, likely IPv6-related #29069

Intermittent DNS name resolution failure, likely IPv6-related #29069

Comments

systemd version the issue has been seen with

Used distribution

Linux kernel version used

CPU architectures issue was seen on

Component

Expected behaviour you didn't see

Unexpected behaviour you saw

Steps to reproduce the problem

Additional program output to the terminal or log subsystem illustrating the issue